<a href="https://colab.research.google.com/github/Dr-Carlos-Villasenor/TRSeminar/blob/main/TRS01_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Seminar
## Dr. Carlos Vilaseñor
## Pandas library

*Pandas* library is an extension of *NumPy* and *Matplotlib*. *Pandas* is used for data hangling and analisys in *Python*.

In [2]:
import pandas as pd
import numpy as np

## Main objects

There are two principal objects in *pandas*, the **Series** object and the  **DataFrame** object

### Series Object

The Series object is like a *NumPy* ndarray but one-dimensional with labels, for example:


In [None]:
s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
print(s)
print('acceso a elemento: ', s['a'])

### DataFrame object

The DataFrame Object is the main object in the library. It represents a heterogenous bidimensional array with a header (very similar to the DataFrames in the *R* programming language). In the following you have and example:


In [None]:
data = { 'Nombre':['Carlos', 'Julia','Fabiola', 'Ernesto'],
           'edad':[28,25,56,21],
        'calificación':[100,89,48,75]}

# df is a common name for DateFrames
df = pd.DataFrame(data)
print(df)

We can access each data on the DataFrame as a matrix or using associative indices.

In [None]:
print(df.iat[0,0])
print(df.iat[1,2])
print(df.at[2,'Nombre'])

DataFrames and Series a mutables data structures

In [None]:
s['b'] = 5
print('s = \n', s)
df.at[0,'Nombre'] = 'Charlie'
print('df = \n', df)

We can also delete rows from the Series and DataFrames.

In [None]:
s.drop(['a'])
print(s)
df.drop('calificación', axis=1)
print(df)
df = df[df.Nombre != 'Charlie']
print(df)

## Data exploration

It is very boring working with such a small DataFrame. Let's upload a DataSet form a CSV file.


In [None]:
!wget 'https://raw.githubusercontent.com/Dr-Carlos-Villasenor/TRSeminar/main/Datasets/countries.csv'
df = pd.read_csv('countries.csv')

For a new dataset, it is always a good idea to explore the data before training any ML model.

In [None]:
# We print the first 5 records
df.head()

In [None]:
# We print the last 5 records
df.tail()

In [None]:
# Shape of the DataFrame
print(df.shape)

In [None]:
# Basic information of the DataFrame
df.info()

In [None]:
# Columns names
df.columns

In [None]:
# Some stats of the numerical features
df.describe()

In [None]:
# extract a numpy matrix from the DataFrame
df.values

In [None]:
# mean and max of all numerical or comparative features
print(df.max())
print(df.min())

In [None]:
# rename a column
df = df.rename(columns={'gdpPercap':'gdp'})

# a different way to do it
#df.rename(columns={'gdpPercap':'gdp'}, inplace=True)
df.head()

## Selection, replace, filter, sorting

In [None]:
# return a column like a series
s1 = df['country']
s2 = df.country
print(s1)
print(s2)

In [None]:
# Return a colum like a DataFrame
df2 = df[['country']]
df2

In [None]:
# replace all the elements in a DF
df2 = df.replace(1952,'one')
df2.head()

In [None]:
# filtering by column
df[df.country == 'Mexico']

In [None]:
# Filtering multiple columns
df[(df.country == 'Mexico')&(df.year >= 1977)&(df.lifeExp < 70)]

Other way to filter data is the following (this can have some issues but is very popular)

In [None]:
#Filtrado con reindexado automático
df_mex = df[df.country == 'Mexico'][df.year >= 1977][df.lifeExp < 70]
df_mex

In [None]:
# Reindexing data
df_mex = df_mex.reset_index()
df_mex = df_mex.drop('index', axis=1)
print(df_mex)

In [None]:
# Ordenar datos por una columna
df_mex.sort_values('gdp')

## Drawing in Pandas

In [None]:
# Histogram of all the numerical features
df.hist()

In [None]:
# Draw only Mexico Data
df[df.country == 'Mexico'].plot(x='year',y='gdp')

In data exploration is highly recommended to draw the scatter graphics, Pandas have a function for this.


In [None]:
# plot scatter matrix
pd.plotting.scatter_matrix(df)

In [None]:
# Draw only mexico scatter matrix
pd.plotting.scatter_matrix(df[df.country == 'Mexico'])

## Write a code to answer the following quaestions

In [None]:
print('How many and which countries have a life expectancy greater than or equal to 80 in 2002?')
paises =
print('Number of countries: ', len(paises))
print('Countries: ', paises)

In [None]:
print('Country with the highest Gross Domestic Product, in any year?')
country =
print('The country with the highest GDP: ', country)

In [None]:
print('In what year did Mexico exceed 70 million inhabitants?')
year =
print('In the year of: ',year)