# Блокнот для написания обработки данных (transform), полученных с newsApi

## Возможные операции трансформации данных:
- Очистка данных (удаление неиспользуемых признаков, дубликатов, выбросов)
- Переформатирование (форматирование данных с разных источников. Форматы дат, валюты и тп)
- Извлечение признаков (создание новых признаков на основе существующих)
- Агрегация (получение необходимых показателей)
- Объединение (объединение данных с нескольких источников)
- Фильтрация (исключение ненужных категорий из набора данных)

In [93]:
import numpy as np
import pandas as pd

In [94]:
data = pd.read_csv("./data_apple.csv", index_col=0)

In [95]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 494 entries, 0 to 493
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       478 non-null    object
 1   title        494 non-null    object
 2   description  486 non-null    object
 3   url          494 non-null    object
 4   urlToImage   486 non-null    object
 5   publishedAt  494 non-null    object
 6   content      494 non-null    object
 7   source.id    29 non-null     object
 8   source.name  494 non-null    object
dtypes: object(9)
memory usage: 38.6+ KB


In [96]:
data.isnull().sum()

author          16
title            0
description      8
url              0
urlToImage       8
publishedAt      0
content          0
source.id      465
source.name      0
dtype: int64

In [97]:
data.columns

Index(['author', 'title', 'description', 'url', 'urlToImage', 'publishedAt',
       'content', 'source.id', 'source.name'],
      dtype='object')

In [98]:
data.head(3)

Unnamed: 0,author,title,description,url,urlToImage,publishedAt,content,source.id,source.name
0,Emma Roth,Apple reportedly challenges the UK’s secretive...,Apple is pushing back against the UK’s secret ...,https://www.theverge.com/news/623977/apple-uk-...,https://platform.theverge.com/wp-content/uploa...,2025-03-04T18:29:39Z,"Apple is appealing the UKs backdoor order, acc...",the-verge,The Verge
1,Brenda Stolyar,"Apple 11-inch and 13-inch iPad Air : Price, Sp...",The 11-inch and 13-inch tablets have the same ...,https://www.wired.com/story/apple-new-ipad-air...,https://media.wired.com/photos/67c71913d63ae42...,2025-03-04T15:46:35Z,Less than a year after upgrading its iPad Air ...,wired,Wired
2,Brittany Vincent,Apple AirTag 4-Pack Drops to Below $70 on Amaz...,Why buy one? Apple AirTag 4-Pack is a way bett...,https://gizmodo.com/apple-airtag-4-pack-drops-...,https://gizmodo.com/app/uploads/2025/02/4airta...,2025-03-05T13:10:16Z,Looking to stop losing your stuff? Apple’s Air...,,Gizmodo.com


## 2. Предобработка

### 2.1 Работа с пропущенными значениями

In [99]:
data.dropna(subset=["title", "content"], inplace=True)

In [100]:
data[data["author"].isnull()].head()

Unnamed: 0,author,title,description,url,urlToImage,publishedAt,content,source.id,source.name
14,,"Apple introduces new iPad Air with M3 chip, Ap...",,https://consent.yahoo.com/v2/collectConsent?se...,,2025-03-04T15:42:05Z,"If you click 'Accept all', we and our partners...",,Yahoo Entertainment
38,,"Novo Nordisk's Wegovy, Apple Air, Palantir: To...",,https://consent.yahoo.com/v2/collectConsent?se...,,2025-03-05T16:37:35Z,"If you click 'Accept all', we and our partners...",,Yahoo Entertainment
46,,Apple M3 Ultra,"Apple today announced M3 Ultra, offering the m...",https://www.apple.com/newsroom/2025/03/apple-r...,https://www.apple.com/newsroom/images/2025/03/...,2025-03-05T13:59:50Z,"March 5, 2025\r\nPRESS RELEASE\r\nApple reveal...",,Apple Newsroom
49,,Apple introduces iPad Air with powerful M3 chi...,"Apple today introduced the new iPad Air, power...",https://www.apple.com/newsroom/2025/03/apple-i...,https://www.apple.com/newsroom/images/2025/03/...,2025-03-04T14:02:18Z,"March 4, 2025\r\nPRESS RELEASE\r\nApple introd...",,Apple Newsroom
88,,"Apple unveils new Mac Studio, the most powerfu...","Apple today announced the new Mac Studio, the ...",https://www.apple.com/newsroom/2025/03/apple-u...,https://www.apple.com/newsroom/images/2025/03/...,2025-03-05T14:00:57Z,"March 5, 2025\r\nPRESS RELEASE\r\nApple unveil...",,Apple Newsroom


In [101]:
data.dropna(subset=["author"], inplace=True)

Была мысль заменить пустых author на Unknown, но данные записи, как видно, не хранят полезную информацию, поэтому удалим

### 2.2 Дубликаты

In [102]:
data.duplicated(subset=["title", "content"]).any()

True

Имеются дубликаты, необходимо удалить

In [103]:
data[data.duplicated(subset=["title", "content"])]

Unnamed: 0,author,title,description,url,urlToImage,publishedAt,content,source.id,source.name
295,Nathan Le Gohlisse,iPhone et iPad : la honte des versions 64 Go e...,Apple a officiellement tiré un trait sur les m...,https://www.frandroid.com/marques/apple/253033...,https://c0.lestechnophiles.com/images.frandroi...,2025-03-05T10:47:41Z,Apple a officiellement tiré un trait sur les m...,,Frandroid


In [104]:
data.drop_duplicates(subset=["title", "content"], inplace=True)

### 2.3 Фильтрация

In [None]:
data.drop(columns=["urlToImage", "source.id", "source.name"], inplace=True)
data.head(3)

Unnamed: 0,author,title,description,url,content
0,Emma Roth,Apple reportedly challenges the UK’s secretive...,Apple is pushing back against the UK’s secret ...,https://www.theverge.com/news/623977/apple-uk-...,"Apple is appealing the UKs backdoor order, acc..."
1,Brenda Stolyar,"Apple 11-inch and 13-inch iPad Air : Price, Sp...",The 11-inch and 13-inch tablets have the same ...,https://www.wired.com/story/apple-new-ipad-air...,Less than a year after upgrading its iPad Air ...
2,Brittany Vincent,Apple AirTag 4-Pack Drops to Below $70 on Amaz...,Why buy one? Apple AirTag 4-Pack is a way bett...,https://gizmodo.com/apple-airtag-4-pack-drops-...,Looking to stop losing your stuff? Apple’s Air...


# Посмотреть, как решается задача определения тональности. На основе этого провести очистку текста (удаление лишних символов, приведение к регистру, очистку от стоп-слов)

## Какие задачи необходимо решить:
- Подсчет самых частых слов
- Определение тональности
- Определение главных тем

### Подсчет самых частых слов

In [106]:
data.loc[0].content

'Apple is appealing the UKs backdoor order, according to the Financial Times.\r\nApple is appealing the UKs backdoor order, according to the Financial Times.\r\nApple is pushing back against the UKs secre… [+1157 chars]'