# Series

*Las Series son arreglos unidimensionales pero los tipos de datos ingresados son únicos*

In [4]:
import pandas as pd 

serie = pd.Series([1, 2, 3])

serie

0    1
1    2
2    3
dtype: int64

# DataFrames

*Los DataFrames son arreglos de Series, prácticamente una matríz*

### DataFrame con lista de listas

A diferencia de un DataFrame creado por diccionario, con este no podemos definir las llaves
pero nos brinda la posibilidad de definir el nombre de columnas y los índices agregandolos como
parámetros del inicializador de la clase DataFrame

In [5]:
df_1= pd.DataFrame([[1, 2, 3],
                                    [4, 5, 6],
                                    [7, 8, 9]],
                                    columns = [2000, 2001, 2002],
                                    index = ['Level A', 'Level B', 'Level C'])

df_1

Unnamed: 0,2000,2001,2002
Level A,1,2,3
Level B,4,5,6
Level C,7,8,9


### DataFrame con diccionarios

Definir un DataFrame con un diccionario es más simple pero aún no se pueden definir
los índices de forma directa. Para definir los índices pasamos un parámetro **index** 
con la lista de nombres de índice

In [6]:
df_2 = pd.DataFrame({2020: [100, 5410, 6630], 
                                     2021: [5585, 6654, 8715], 
                                     2022: [569, 5554, 8123]},
                                   index = ['January', 'February', 'March'])

df_2

Unnamed: 0,2020,2021,2022
January,100,5585,569
February,5410,6654,5554
March,6630,8715,8123


# Reading Data

*Lectura de un DataSet con Pandas

In [7]:
#Esta opción sirve para seleccionar cuantas rows mostrar
pd.options.display.max_rows = 15

abc_news = pd.read_csv('abc_news.csv')

abc_news

#Muestran los primeros y últimos datos respectivamente
#abc_news.head()
#abc_news.tail()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers
...,...,...
94,20030219,mayor warns landfill protesters
95,20030219,meeting to consider tick clearance costs
96,20030219,meeting to focus on broken hill water woes
97,20030219,moderate lift in wages growth


# Index and selection

### Dictionary like

*Since my DataSet has two columns the output of this selection is not a DataFrame*

In [14]:
abc_news['publish_date']

0     20030219
1     20030219
2     20030219
3     20030219
4     20030219
        ...   
94    20030219
95    20030219
96    20030219
97    20030219
98    20030219
Name: publish_date, Length: 99, dtype: int64

### NumPy like

*In this example we are selecting from row 10 to row 14*

In [15]:
abc_news.iloc[10:15]

Unnamed: 0,publish_date,headline_text
10,20030219,australia to contribute 10 million in aid to iraq
11,20030219,barca take record as robson celebrates birthda...
12,20030219,bathhouse plans move ahead
13,20030219,big hopes for launceston cycling championship
14,20030219,big plan to boost paroo water supplies


In [17]:
abc_news.iloc[15]['headline_text']

'blizzard buries united states in bills'

In [16]:
abc_news.iloc[:5, 0]

0    20030219
1    20030219
2    20030219
3    20030219
4    20030219
Name: publish_date, dtype: int64

### Label based

*Label based is the best way to get info from a DataSet since it is super legible and*
*easy to apply. As shown below, the coder can use the names of the columns to*
*access them in a direct way and not with number indexes as made with NumPy like selection*

In [18]:
# abc_news.loc[:, 'publish_date': 'headline_text']

# Data Wrangling

### 1. Add a column that identifies each row

In [20]:
abc_news['uid'] = 'ABC Australia'

abc_news

Unnamed: 0,publish_date,headline_text,uid
0,20030219,aba decides against community broadcasting lic...,ABC Australia
1,20030219,act fire witnesses must be aware of defamation,ABC Australia
2,20030219,a g calls for infrastructure protection summit,ABC Australia
3,20030219,air nz staff in aust strike for pay rise,ABC Australia
4,20030219,air nz strike to affect australian travellers,ABC Australia
...,...,...,...
94,20030219,mayor warns landfill protesters,ABC Australia
95,20030219,meeting to consider tick clearance costs,ABC Australia
96,20030219,meeting to focus on broken hill water woes,ABC Australia
97,20030219,moderate lift in wages growth,ABC Australia


# Additional cleanup

### 1. Add a unique uid for each article

In [23]:
import hashlib 

#Remember:
#axis = 0 columns
#axis = 1 rows
uids = (abc_news.apply(lambda row: hashlib.md5(row['headline_text'].encode()), axis = 1)
                .apply(lambda hash_object: hash_object.hexdigest())
           )

abc_news['uid'] = uids 
abc_news.set_index('uid')

Unnamed: 0_level_0,publish_date,headline_text
uid,Unnamed: 1_level_1,Unnamed: 2_level_1
46c9d3509b24e0900d23c6f1d6a2808e,20030219,aba decides against community broadcasting lic...
7df0205cf79968bff5055d4746e06ccb,20030219,act fire witnesses must be aware of defamation
4d5b7fd2fd72c7fbb5508ec6da59f34d,20030219,a g calls for infrastructure protection summit
f6c67d2f6df969c1786b8034bb420d62,20030219,air nz staff in aust strike for pay rise
c3c0e31e9f37a0f0db23b9944d28d967,20030219,air nz strike to affect australian travellers
...,...,...
755f52a3db4ee713d9aae5c36f292043,20030219,mayor warns landfill protesters
9c391bf7c75effe7df478e5aa4f2cef7,20030219,meeting to consider tick clearance costs
d091eb9e7c9550854c5135bc9a3566fe,20030219,meeting to focus on broken hill water woes
f315b7aee26d2ec92162525af9b48cb0,20030219,moderate lift in wages growth


### 2. Clean unexpected chars
*In my DataFrame there are not unexpected chars but despite this, in several DataFrames there are plenty of them. This is a scope of code with no useful application with this DataFrame but it can be implemented nevertheless*

In [31]:
stripped_body = (abc_news
                                 .apply(lambda row: row['headline_text'], axis = 1)
                                 .apply(lambda headline_text: list(headline_text))
                                 .apply(lambda letters: list(map(lambda letter: letter.replace('\n', ''), letters)))
                                 .apply(lambda letters: ''.join(letters))
                            )

# Data enrichment

*This section is meant to tokenize the title of each article. Tokenize means to separate each word of the title*

In [8]:
import nltk
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def tokenize_columns(df, column_name):
    return (df
                    .dropna()
                    .apply(lambda row: nltk.word_tokenize(row[column_name]), axis = 1)
                    .apply(lambda tokens: list(filter(lambda token: token.isalpha(), tokens)))
                    .apply(lambda tokens: list(map(lambda token: token.lower(), tokens)))
                    .apply(lambda word_list: list(filter(lambda word: word not in stop_words, word_list)))
                    .apply(lambda valid_word_list: len(valid_word_list))
                )

abc_news['n_tokens_title'] = tokenize_columns(abc_news, 'headline_text')

abc_news

[nltk_data] Downloading package punkt to /home/miqueas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/miqueas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,publish_date,headline_text,n_tokens_title
0,20030219,aba decides against community broadcasting lic...,5
1,20030219,act fire witnesses must be aware of defamation,6
2,20030219,a g calls for infrastructure protection summit,5
3,20030219,air nz staff in aust strike for pay rise,7
4,20030219,air nz strike to affect australian travellers,6
...,...,...,...
94,20030219,mayor warns landfill protesters,4
95,20030219,meeting to consider tick clearance costs,5
96,20030219,meeting to focus on broken hill water woes,6
97,20030219,moderate lift in wages growth,4
