# Análisis de datos con Pandas
![](https://ecmetrics.com/es/wp-content/uploads/2017/08/big-data-para-el-online-community.jpg)

Una de las habilidades con más demanda en la actualidad es la de ser capáz de analizar datos y obtener resultados, tendencias, resúmenes, aproximaciones... en general, tomar los datos y obtener información. La librería pandas nos permite obtener información de los datos utilizando instrucciones consisas y simples, sobre un DataFrame que puede ser tan grande como el sistema pueda soportarlo (Excel tiene un limite actual de 1.048.576 filas por 16.384 columnas sin importar si se cuanta con más o menos RAM).

Luis A. Muñoz

## Análisis de datos de la Liga Española

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("dataset_la_liga.csv")

In [3]:
df.tail()

Unnamed: 0,season,club,home_win,away_win,home_loss,away_loss,matches_won,matches_lost,matches_drawn,total_matches,points,home_goals,away_goals,goals_scored,goals_conceded,goal_difference
903,2016-17,Villarreal,11,8,4,5,19,9,10,38,67,35,21,56,33,23
904,2016-17,Sevilla,14,7,1,7,21,8,9,38,72,39,30,69,49,20
905,2016-17,Atletico de Madrid,14,9,3,3,23,6,9,38,78,40,30,70,27,43
906,2016-17,Barcelona,15,13,1,3,28,4,6,38,90,64,52,116,37,79
907,2016-17,Real Madrid,14,15,1,2,29,3,6,38,93,48,58,106,41,65


El método `describe()` nos retorna un DataFrame con la siguiente información por columna:
    
- Número de valores válidos (no NaN)
- La media
- La desviación estándar
- El valor mínimo y máximo
- Los percentiles (25%, 50% y 75%)

In [4]:
df.describe()

Unnamed: 0,home_win,away_win,home_loss,away_loss,matches_won,matches_lost,matches_drawn,total_matches,points,home_goals,away_goals,goals_scored,goals_conceded,goal_difference
count,908.0,908.0,908.0,908.0,908.0,908.0,908.0,908.0,908.0,908.0,908.0,908.0,908.0,908.0
mean,9.582599,4.118943,4.118943,9.582599,13.701542,13.701542,9.577093,36.980176,50.681718,29.508811,18.267621,47.776432,47.776432,0.0
std,3.107555,2.776725,2.531206,3.092974,4.96605,4.589442,2.986852,2.52215,14.104626,9.757192,7.569011,15.593429,11.855507,21.680653
min,1.0,0.0,0.0,1.0,2.0,1.0,1.0,30.0,13.0,11.0,4.0,15.0,18.0,-64.0
25%,7.0,2.0,2.0,8.0,10.0,11.0,7.0,34.0,42.0,23.0,13.0,37.0,39.0,-14.0
50%,9.0,4.0,4.0,10.0,13.0,14.0,9.0,38.0,48.0,28.0,17.0,45.0,47.0,-3.0
75%,12.0,6.0,6.0,12.0,16.0,17.0,12.0,38.0,59.0,34.0,22.0,54.0,56.0,10.25
max,19.0,16.0,15.0,18.0,32.0,29.0,18.0,44.0,100.0,78.0,58.0,121.0,94.0,89.0


El método `info` nos da información sobre las columnas y los tipos de datos. Esta es una forma de comprobar si tenemos valores NaN.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 908 entries, 0 to 907
Data columns (total 16 columns):
season             908 non-null object
club               908 non-null object
home_win           908 non-null int64
away_win           908 non-null int64
home_loss          908 non-null int64
away_loss          908 non-null int64
matches_won        908 non-null int64
matches_lost       908 non-null int64
matches_drawn      908 non-null int64
total_matches      908 non-null int64
points             908 non-null int64
home_goals         908 non-null int64
away_goals         908 non-null int64
goals_scored       908 non-null int64
goals_conceded     908 non-null int64
goal_difference    908 non-null int64
dtypes: int64(14), object(2)
memory usage: 113.6+ KB


Tomando este DataFrame, podemos responder algunas preguntas simples:
    
### ¿Cuáles han sido los equipos que han ganado La Liga por temporada?  

In [6]:
df.head()

Unnamed: 0,season,club,home_win,away_win,home_loss,away_loss,matches_won,matches_lost,matches_drawn,total_matches,points,home_goals,away_goals,goals_scored,goals_conceded,goal_difference
0,1970-71,Real Zaragoza,3,0,5,13,3,18,9,30,18,14,8,22,54,-32
1,1970-71,Elche,4,0,5,11,4,16,10,30,22,17,8,25,46,-21
2,1970-71,Las Palmas,5,0,3,12,5,15,10,30,25,25,8,33,42,-9
3,1970-71,Sabadell,8,0,3,14,8,17,5,30,29,19,9,28,49,-21
4,1970-71,Espanyol,7,1,4,9,8,13,9,30,33,13,5,18,25,-7


In [7]:
df_campeones_por_temporada = df.sort_values('points', ascending=False).drop_duplicates(['season']).sort_values('season')[['season', 'club', 'points']]
df_campeones_por_temporada

Unnamed: 0,season,club,points
15,1970-71,Barcelona,62
33,1971-72,Real Madrid,66
51,1972-73,Atletico de Madrid,68
69,1973-74,Barcelona,71
87,1974-75,Real Madrid,70
105,1975-76,Real Madrid,68
123,1976-77,Atletico de Madrid,65
141,1977-78,Real Madrid,69
159,1978-79,Real Madrid,63
177,1979-80,Real Madrid,75


### ¿Cuál ha sido el equipo que más veces ha ganado La Liga?  

In [8]:
df_campeones_por_temporada[['club', 'season']].groupby('club').count().sort_values('season', ascending=False)

Unnamed: 0_level_0,season
club,Unnamed: 1_level_1
Real Madrid,20
Barcelona,18
Atletico de Madrid,4
Valencia,2
Athletic Club,1
Deportivo,1
Real Sociedad,1


### ¿Cuál es el equipo que más veces ha perdido La Liga?

In [9]:
df_perdedores_por_temporada = df.sort_values('points').drop_duplicates(['season']).sort_values('season')[['season', 'club', 'points']]
df_perdedores_por_temporada[['club', 'season']].groupby('club').count().sort_values('season', ascending=False).iloc[0:5]

Unnamed: 0_level_0,season
club,Unnamed: 1_level_1
Murcia,3
Real Zaragoza,3
Salamanca,2
Racing de Santander,2
Rayo Vallecano,2


### ¿Cuál fueron los tres años en el que se obtuvieron los más altos puntajes en el torneo? 

In [10]:
df[['season', 'points']].groupby(by='season').sum().sort_values('points', ascending=False).iloc[0:3]

Unnamed: 0_level_0,points
season,Unnamed: 1_level_1
1996-97,1267
1995-96,1256
1986-87,1087


### ¿Cuál es el equipo que más veces ha ganado torneos en casa?

In [11]:
df[['club', 'home_win']].groupby(by='club').sum().sort_values('home_win', ascending=False).iloc[0:3]

Unnamed: 0_level_0,home_win
club,Unnamed: 1_level_1
Real Madrid,663
Barcelona,652
Atletico de Madrid,520


### ¿Cuál es el equipo que más veces ha ganado torneos de visitante?

In [12]:
df[['club', 'away_win']].groupby(by='club').sum().sort_values('away_win', ascending=False).iloc[0:3]

Unnamed: 0_level_0,away_win
club,Unnamed: 1_level_1
Real Madrid,388
Barcelona,373
Atletico de Madrid,259


### Si Sevilla va a jugar con Villareal, ¿por cuál equipo debería apostar?

In [13]:
df[df['club'] == 'Sevilla'].describe()

Unnamed: 0,home_win,away_win,home_loss,away_loss,matches_won,matches_lost,matches_drawn,total_matches,points,home_goals,away_goals,goals_scored,goals_conceded,goal_difference
count,41.0,41.0,41.0,41.0,41.0,41.0,41.0,41.0,41.0,41.0,41.0,41.0,41.0,41.0
mean,10.439024,4.341463,3.658537,9.512195,14.780488,13.170732,9.02439,36.97561,53.365854,30.146341,19.097561,49.243902,45.439024,3.804878
std,2.335046,2.798301,2.092961,2.460914,3.636703,3.121718,2.361015,2.650357,9.971851,6.687903,6.992156,11.612021,7.861453,12.266254
min,5.0,0.0,0.0,4.0,5.0,7.0,4.0,30.0,28.0,17.0,6.0,29.0,31.0,-25.0
25%,9.0,2.0,2.0,8.0,13.0,11.0,7.0,34.0,47.0,26.0,13.0,41.0,40.0,-4.0
50%,11.0,4.0,4.0,10.0,14.0,13.0,9.0,38.0,51.0,29.0,18.0,48.0,45.0,2.0
75%,12.0,7.0,5.0,11.0,17.0,14.0,11.0,38.0,60.0,35.0,25.0,56.0,49.0,14.0
max,15.0,10.0,9.0,14.0,23.0,23.0,15.0,44.0,76.0,46.0,33.0,75.0,69.0,29.0


In [14]:
df[df['club'] == 'Villarreal'].describe()

Unnamed: 0,home_win,away_win,home_loss,away_loss,matches_won,matches_lost,matches_drawn,total_matches,points,home_goals,away_goals,goals_scored,goals_conceded,goal_difference
count,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0
mean,10.294118,5.352941,3.941176,8.294118,15.647059,12.235294,10.117647,38.0,57.058824,30.705882,21.764706,52.470588,46.411765,6.058824
std,2.3655,2.644361,1.477777,2.640187,4.030217,2.969205,2.471901,0.0,10.568057,4.753482,4.548755,8.147988,8.818013,13.36259
min,6.0,2.0,1.0,4.0,8.0,9.0,5.0,38.0,36.0,23.0,13.0,39.0,33.0,-16.0
25%,9.0,3.0,3.0,7.0,14.0,9.0,8.0,38.0,54.0,28.0,19.0,47.0,39.0,-2.0
50%,10.0,5.0,4.0,9.0,16.0,12.0,10.0,38.0,59.0,30.0,21.0,50.0,44.0,7.0
75%,12.0,7.0,4.0,10.0,18.0,14.0,12.0,38.0,64.0,33.0,25.0,58.0,53.0,11.0
max,14.0,12.0,7.0,13.0,24.0,18.0,15.0,38.0,77.0,41.0,30.0,69.0,63.0,32.0


### Gráfico de dispersión de los campeones de La Liga

In [None]:
df1 = df.sort_values('points', ascending=False).drop_duplicates(['season']).sort_values('season')[['season', 'club', 'points']]
df2 = df1[['club', 'season']].groupby('club').count().sort_values('season', ascending=False)
df3 = df1[['club', 'points']].groupby(by='club').mean()

df2.reset_index(inplace=True)
df3.reset_index(inplace=True)

#print(df2)
#print(df3)
df_result = df2.merge(df3).drop_duplicates('club').sort_values('season')
df_result

In [None]:
import matplotlib.pyplot as plt
plt.style.use('default')

df_result.plot(kind='scatter', figsize=(8, 4), x='club', y='season', c='points', colormap='coolwarm', sharex=False)
plt.xticks(rotation=45)
plt.show()

## Análisis de datos de COVID-19

In [None]:
import json
import requests

url = "https://pomber.github.io/covid19/timeseries.json"
r = requests.get(url)
data = r.json()

df = pd.DataFrame.from_dict(data['Peru'])

In [None]:
df.tail()

In [None]:
df.index = pd.DatetimeIndex(pd.to_datetime(df['date']))

In [None]:
df.drop(columns='date', inplace=True)

In [None]:
df.tail()

### Obtener datos de casos diarios

In [None]:
df_diarios = df.rolling(window=2).apply(lambda x: x.iloc[1] - x.iloc[0])

In [None]:
df_diarios.tail()

In [None]:
df_diarios.plot(kind='bar', y=['confirmed', 'deaths', 'recovered'], figsize=(12, 12), subplots=True, layout=(3, 1))
plt.xticks([])
plt.show()

### ¿Cuál es la tendencia en los casos de confirmados?

In [None]:
# Moving Average 7 days
ma = df.rolling(window=7).mean()

In [None]:
df['confirmed'].plot()
ma['confirmed'].plot()
plt.legend(['Confirmed', 'Mov. Average 7'])
plt.grid()
plt.show()

In [None]:
df['recovered'].plot()
ma['recovered'].plot()
plt.legend(['Recovered', 'Mov. Average 7'])
plt.grid()
plt.show()

In [None]:
df['deaths'].plot()
ma['deaths'].plot()
plt.legend(['Deaths', 'Mov. Average 7'])
plt.grid()
plt.show()