# Pandas

Pandas es un herramienta de Python que sirve para manipular y analizar datos.

Las dos estructuras básicas que usa pandas para guardar la información son los *DataFrames* y las *Series*.


## Carga del paquete

In [42]:
import pandas as pd

## *DataFrame*

Un *DataFrame* tiene filas y columnas para almacenar los datos. Cada fila representa los datos de un caso, igual que en una base de datos relacional o en una hoja de cálculo.

In [43]:
employees = pd.DataFrame(
    {
        "nombre": ["Aitor Tilla", "Helen Chufe", "Lola Mento"],
        "salario": [2000, 2500, 1400],
        "sexo": ["male", "female", "female"]
    },index=['A', 'B', 'C']
)

print(employees)

        nombre  salario    sexo
A  Aitor Tilla     2000    male
B  Helen Chufe     2500  female
C   Lola Mento     1400  female


## Series

Las *Series* son secuencias de datos (como en una lista).

In [44]:
numbers = pd.Series([500, 200, 300],
             index=['A', 'B', 'C'])
print(numbers)

A    500
B    200
C    300
dtype: int64


Las *Series* se pueden etiquetar igual que los *DataFrames*.

In [45]:
extra_salary = pd.Series([500, 200, 300, 400, 850, 600], name="extra salary")
print(extra_salary)

0    500
1    200
2    300
3    400
4    850
5    600
Name: extra salary, dtype: int64


Cada columna de un *DataFrame* es una *Series*. Por ejemplo, si queremos trabajar con el sueldo, sacamos únicamente "salary".

In [46]:
print(employees["salario"])

A    2000
B    2500
C    1400
Name: salario, dtype: int64


In [47]:
print(f"El salario máximo es {employees['salario'].max()} € y el mínimo {employees['salario'].min()} €")

El salario máximo es 2500 € y el mínimo 1400 €


También podemos sacar datos estadísticos.

In [48]:
employees.describe()

Unnamed: 0,salario
count,3.0
mean,1966.666667
std,550.757055
min,1400.0
25%,1700.0
50%,2000.0
75%,2250.0
max,2500.0


## Carga de datos de un CSV

Normalmente, los datos se obtienen de una fuente externa, como puede ser un archivo CSV.

Hemos subido el archivo `employees.csv` a la carpeta `sample_data` de nuestro Colab.

In [49]:
employees = pd.read_csv("sample_data/employees.csv")

In [50]:
print(employees)

    EMPLOYEE_ID   FIRST_NAME    LAST_NAME     EMAIL  PHONE_NUMBER  HIRE_DATE  \
0           198       Donald     OConnell  DOCONNEL  650.507.9833  21-JUN-07   
1           199      Douglas        Grant    DGRANT  650.507.9844  13-JAN-08   
2           200     Jennifer       Whalen   JWHALEN  515.123.4444  17-SEP-03   
3           201      Michael    Hartstein  MHARTSTE  515.123.5555  17-FEB-04   
4           202          Pat          Fay      PFAY  603.123.6666  17-AUG-05   
5           203        Susan       Mavris   SMAVRIS  515.123.7777  07-JUN-02   
6           204      Hermann         Baer     HBAER  515.123.8888  07-JUN-02   
7           205      Shelley      Higgins  SHIGGINS  515.123.8080  07-JUN-02   
8           206      William        Gietz    WGIETZ  515.123.8181  07-JUN-02   
9           100       Steven         King     SKING  515.123.4567  17-JUN-03   
10          101        Neena      Kochhar  NKOCHHAR  515.123.4568  21-SEP-05   
11          102          Lex      De Haa

In [51]:
employees.describe()

Unnamed: 0,EMPLOYEE_ID,SALARY,DEPARTMENT_ID
count,50.0,50.0,50.0
mean,134.76,6182.32,57.6
std,33.631594,4586.181772,25.11687
min,100.0,2100.0,10.0
25%,112.25,2725.0,50.0
50%,124.5,4600.0,50.0
75%,136.75,8150.0,60.0
max,206.0,24000.0,110.0


Igual que con los comando `head` y `tail` de Linux, podemos visualizar los primeros y los últimos elementos del *DataFrame*

In [52]:
employees.head()

Unnamed: 0,EMPLOYEE_ID,FIRST_NAME,LAST_NAME,EMAIL,PHONE_NUMBER,HIRE_DATE,JOB_ID,SALARY,COMMISSION_PCT,MANAGER_ID,DEPARTMENT_ID
0,198,Donald,OConnell,DOCONNEL,650.507.9833,21-JUN-07,SH_CLERK,2600,-,124,50
1,199,Douglas,Grant,DGRANT,650.507.9844,13-JAN-08,SH_CLERK,2600,-,124,50
2,200,Jennifer,Whalen,JWHALEN,515.123.4444,17-SEP-03,AD_ASST,4400,-,101,10
3,201,Michael,Hartstein,MHARTSTE,515.123.5555,17-FEB-04,MK_MAN,13000,-,100,20
4,202,Pat,Fay,PFAY,603.123.6666,17-AUG-05,MK_REP,6000,-,201,20


In [53]:
employees.head(30)

Unnamed: 0,EMPLOYEE_ID,FIRST_NAME,LAST_NAME,EMAIL,PHONE_NUMBER,HIRE_DATE,JOB_ID,SALARY,COMMISSION_PCT,MANAGER_ID,DEPARTMENT_ID
0,198,Donald,OConnell,DOCONNEL,650.507.9833,21-JUN-07,SH_CLERK,2600,-,124,50
1,199,Douglas,Grant,DGRANT,650.507.9844,13-JAN-08,SH_CLERK,2600,-,124,50
2,200,Jennifer,Whalen,JWHALEN,515.123.4444,17-SEP-03,AD_ASST,4400,-,101,10
3,201,Michael,Hartstein,MHARTSTE,515.123.5555,17-FEB-04,MK_MAN,13000,-,100,20
4,202,Pat,Fay,PFAY,603.123.6666,17-AUG-05,MK_REP,6000,-,201,20
5,203,Susan,Mavris,SMAVRIS,515.123.7777,07-JUN-02,HR_REP,6500,-,101,40
6,204,Hermann,Baer,HBAER,515.123.8888,07-JUN-02,PR_REP,10000,-,101,70
7,205,Shelley,Higgins,SHIGGINS,515.123.8080,07-JUN-02,AC_MGR,12008,-,101,110
8,206,William,Gietz,WGIETZ,515.123.8181,07-JUN-02,AC_ACCOUNT,8300,-,205,110
9,100,Steven,King,SKING,515.123.4567,17-JUN-03,AD_PRES,24000,-,-,90


In [54]:
employees.tail()

Unnamed: 0,EMPLOYEE_ID,FIRST_NAME,LAST_NAME,EMAIL,PHONE_NUMBER,HIRE_DATE,JOB_ID,SALARY,COMMISSION_PCT,MANAGER_ID,DEPARTMENT_ID
45,136,Hazel,Philtanker,HPHILTAN,650.127.1634,06-FEB-08,ST_CLERK,2200,-,122,50
46,137,Renske,Ladwig,RLADWIG,650.121.1234,14-JUL-03,ST_CLERK,3600,-,123,50
47,138,Stephen,Stiles,SSTILES,650.121.2034,26-OCT-05,ST_CLERK,3200,-,123,50
48,139,John,Seo,JSEO,650.121.2019,12-FEB-06,ST_CLERK,2700,-,123,50
49,140,Joshua,Patel,JPATEL,650.121.1834,06-APR-06,ST_CLERK,2500,-,123,50


In [55]:
employees.tail(10)

Unnamed: 0,EMPLOYEE_ID,FIRST_NAME,LAST_NAME,EMAIL,PHONE_NUMBER,HIRE_DATE,JOB_ID,SALARY,COMMISSION_PCT,MANAGER_ID,DEPARTMENT_ID
40,131,James,Marlow,JAMRLOW,650.124.7234,16-FEB-05,ST_CLERK,2500,-,121,50
41,132,TJ,Olson,TJOLSON,650.124.8234,10-APR-07,ST_CLERK,2100,-,121,50
42,133,Jason,Mallin,JMALLIN,650.127.1934,14-JUN-04,ST_CLERK,3300,-,122,50
43,134,Michael,Rogers,MROGERS,650.127.1834,26-AUG-06,ST_CLERK,2900,-,122,50
44,135,Ki,Gee,KGEE,650.127.1734,12-DEC-07,ST_CLERK,2400,-,122,50
45,136,Hazel,Philtanker,HPHILTAN,650.127.1634,06-FEB-08,ST_CLERK,2200,-,122,50
46,137,Renske,Ladwig,RLADWIG,650.121.1234,14-JUL-03,ST_CLERK,3600,-,123,50
47,138,Stephen,Stiles,SSTILES,650.121.2034,26-OCT-05,ST_CLERK,3200,-,123,50
48,139,John,Seo,JSEO,650.121.2019,12-FEB-06,ST_CLERK,2700,-,123,50
49,140,Joshua,Patel,JPATEL,650.121.1834,06-APR-06,ST_CLERK,2500,-,123,50


## Cómo guardar los datos en otro formato

El *DataFrame* se puede guardar en varios formatos como hoja de cálculo o JSON por ejemplo.

In [56]:
employees.to_excel("sample_data/employees.xlsx")
employees.to_json("sample_data/employees.json")

## Información sobre los tipos de datos


In [57]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   EMPLOYEE_ID     50 non-null     int64 
 1   FIRST_NAME      50 non-null     object
 2   LAST_NAME       50 non-null     object
 3   EMAIL           50 non-null     object
 4   PHONE_NUMBER    50 non-null     object
 5   HIRE_DATE       50 non-null     object
 6   JOB_ID          50 non-null     object
 7   SALARY          50 non-null     int64 
 8   COMMISSION_PCT  50 non-null     object
 9   MANAGER_ID      50 non-null     object
 10  DEPARTMENT_ID   50 non-null     int64 
dtypes: int64(3), object(8)
memory usage: 4.4+ KB


## Selección de subconjuntos de datos

Vamos a utilizar un dataset de Kaggle sobre la calidad del café:
https://www.kaggle.com/datasets/volpatto/coffee-quality-database-from-cqi

In [58]:
coffe_df = pd.read_csv("/content/merged_data_cleaned.csv")

In [59]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [60]:
coffe_df.head()

Unnamed: 0.1,Unnamed: 0,Species,Owner,Country.of.Origin,Farm.Name,Lot.Number,Mill,ICO.Number,Company,Altitude,...,Color,Category.Two.Defects,Expiration,Certification.Body,Certification.Address,Certification.Contact,unit_of_measurement,altitude_low_meters,altitude_high_meters,altitude_mean_meters
0,0,Arabica,metad plc,Ethiopia,metad plc,,metad plc,2014/2015,metad agricultural developmet plc,1950-2200,...,Green,0,"April 3rd, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1950.0,2200.0,2075.0
1,1,Arabica,metad plc,Ethiopia,metad plc,,metad plc,2014/2015,metad agricultural developmet plc,1950-2200,...,Green,1,"April 3rd, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1950.0,2200.0,2075.0
2,2,Arabica,grounds for health admin,Guatemala,"san marcos barrancas ""san cristobal cuch",,,,,1600 - 1800 m,...,,0,"May 31st, 2011",Specialty Coffee Association,36d0d00a3724338ba7937c52a378d085f2172daa,0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660,m,1600.0,1800.0,1700.0
3,3,Arabica,yidnekachew dabessa,Ethiopia,yidnekachew dabessa coffee plantation,,wolensu,,yidnekachew debessa coffee plantation,1800-2200,...,Green,2,"March 25th, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1800.0,2200.0,2000.0
4,4,Arabica,metad plc,Ethiopia,metad plc,,metad plc,2014/2015,metad agricultural developmet plc,1950-2200,...,Green,2,"April 3rd, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1950.0,2200.0,2075.0


In [61]:
coffe_df.info()
coffe_df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1339 entries, 0 to 1338
Data columns (total 44 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             1339 non-null   int64  
 1   Species                1339 non-null   object 
 2   Owner                  1332 non-null   object 
 3   Country.of.Origin      1338 non-null   object 
 4   Farm.Name              980 non-null    object 
 5   Lot.Number             276 non-null    object 
 6   Mill                   1021 non-null   object 
 7   ICO.Number             1180 non-null   object 
 8   Company                1130 non-null   object 
 9   Altitude               1113 non-null   object 
 10  Region                 1280 non-null   object 
 11  Producer               1107 non-null   object 
 12  Number.of.Bags         1339 non-null   int64  
 13  Bag.Weight             1339 non-null   object 
 14  In.Country.Partner     1339 non-null   object 
 15  Harv

Unnamed: 0.1,Unnamed: 0,Number.of.Bags,Aroma,Flavor,Aftertaste,Acidity,Body,Balance,Uniformity,Clean.Cup,Sweetness,Cupper.Points,Total.Cup.Points,Moisture,Category.One.Defects,Quakers,Category.Two.Defects,altitude_low_meters,altitude_high_meters,altitude_mean_meters
count,1339.0,1339.0,1339.0,1339.0,1339.0,1339.0,1339.0,1339.0,1339.0,1339.0,1339.0,1339.0,1339.0,1339.0,1339.0,1338.0,1339.0,1109.0,1109.0,1109.0
mean,669.0,154.182972,7.566706,7.520426,7.401083,7.535706,7.517498,7.518013,9.834877,9.835108,9.856692,7.503376,82.089851,0.088379,0.479462,0.173393,3.556385,1750.713315,1799.347775,1775.030545
std,386.680316,129.987162,0.37756,0.398442,0.404463,0.379827,0.370064,0.408943,0.554591,0.763946,0.616102,0.473464,3.500575,0.048287,2.549683,0.832121,5.312541,8669.440545,8668.805771,8668.62608
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,334.5,14.0,7.42,7.33,7.25,7.33,7.33,7.33,10.0,10.0,10.0,7.25,81.08,0.09,0.0,0.0,0.0,1100.0,1100.0,1100.0
50%,669.0,175.0,7.58,7.58,7.42,7.58,7.5,7.5,10.0,10.0,10.0,7.5,82.5,0.11,0.0,0.0,2.0,1310.64,1350.0,1310.64
75%,1003.5,275.0,7.75,7.75,7.58,7.75,7.67,7.75,10.0,10.0,10.0,7.75,83.67,0.12,0.0,0.0,4.0,1600.0,1650.0,1600.0
max,1338.0,1062.0,8.75,8.83,8.67,8.75,8.58,8.75,10.0,10.0,10.0,10.0,90.58,0.28,63.0,11.0,55.0,190164.0,190164.0,190164.0


Vamos a extraer un subconjunto de los datos que contenga la siguiente información: Country.of.Origin, Region, Variety, Aroma, Flavor, Body, Sweetness, Total.Cup.Points.

In [62]:
coffe_subset = coffe_df[["Country.of.Origin", "Region", "Variety", "Aroma", "Flavor", "Body", "Sweetness", "Total.Cup.Points"]]

print(coffe_subset)

     Country.of.Origin                         Region  Variety  Aroma  Flavor  \
0             Ethiopia                   guji-hambela      NaN   8.67    8.83   
1             Ethiopia                   guji-hambela    Other   8.75    8.67   
2            Guatemala                            NaN  Bourbon   8.42    8.50   
3             Ethiopia                         oromia      NaN   8.17    8.58   
4             Ethiopia                   guji-hambela    Other   8.25    8.50   
...                ...                            ...      ...    ...     ...   
1334           Ecuador               san juan, playas      NaN   7.75    7.58   
1335           Ecuador               san juan, playas      NaN   7.50    7.67   
1336     United States  kwanza norte province, angola      NaN   7.33    7.33   
1337             India                            NaN      NaN   7.42    6.83   
1338           Vietnam                            NaN      NaN   6.75    6.67   

      Body  Sweetness  Tota

In [63]:
coffe_subset

Unnamed: 0,Country.of.Origin,Region,Variety,Aroma,Flavor,Body,Sweetness,Total.Cup.Points
0,Ethiopia,guji-hambela,,8.67,8.83,8.50,10.00,90.58
1,Ethiopia,guji-hambela,Other,8.75,8.67,8.42,10.00,89.92
2,Guatemala,,Bourbon,8.42,8.50,8.33,10.00,89.75
3,Ethiopia,oromia,,8.17,8.58,8.50,10.00,89.00
4,Ethiopia,guji-hambela,Other,8.25,8.50,8.42,10.00,88.83
...,...,...,...,...,...,...,...,...
1334,Ecuador,"san juan, playas",,7.75,7.58,5.08,7.75,78.75
1335,Ecuador,"san juan, playas",,7.50,7.67,5.17,8.42,78.08
1336,United States,"kwanza norte province, angola",,7.33,7.33,7.50,7.42,77.17
1337,India,,,7.42,6.83,7.25,7.08,75.08


# Filtrado de datos


In [64]:
# Todos los cafés de Ecuador

coffe_subset[coffe_subset['Country.of.Origin'] == 'Ecuador']

# Ojo, lo que sale con
# coffe_subset['Country.of.Origin'] == 'Ecuador'
# es diferente.

# Esta linea no devuelve un DataFrame filtrado, sino una Serie (una columna con valores booleanos).
# Lo que hace es evaluar la condicion para cada fila de la columna 'Country.of.Origin',
# devolviendo True si el valor es 'Ecuador' y False en caso contrario.

Unnamed: 0,Country.of.Origin,Region,Variety,Aroma,Flavor,Body,Sweetness,Total.Cup.Points
286,Ecuador,"province of manabi, ecuador",,7.5,7.67,7.83,10.0,83.83
1334,Ecuador,"san juan, playas",,7.75,7.58,5.08,7.75,78.75
1335,Ecuador,"san juan, playas",,7.5,7.67,5.17,8.42,78.08


In [65]:
# Los cafés con más de 85 puntos

coffe_subset[coffe_subset['Total.Cup.Points'] > 85]


Unnamed: 0,Country.of.Origin,Region,Variety,Aroma,Flavor,Body,Sweetness,Total.Cup.Points
0,Ethiopia,guji-hambela,,8.67,8.83,8.50,10.0,90.58
1,Ethiopia,guji-hambela,Other,8.75,8.67,8.42,10.0,89.92
2,Guatemala,,Bourbon,8.42,8.50,8.33,10.0,89.75
3,Ethiopia,oromia,,8.17,8.58,8.50,10.0,89.00
4,Ethiopia,guji-hambela,Other,8.25,8.50,8.42,10.0,88.83
...,...,...,...,...,...,...,...,...
91,United States (Hawaii),kona,Hawaiian Kona,7.58,7.83,7.83,10.0,85.08
92,Colombia,cundinamarca,Caturra,7.83,8.00,7.83,10.0,85.08
93,United States (Hawaii),kona,,8.08,8.17,8.08,10.0,85.08
94,Ethiopia,sidamo,,7.67,8.00,7.92,10.0,85.08


In [66]:
# Todos los cafés de Colombia, Guatemala o Mexico

coffe_subset[coffe_subset['Country.of.Origin'].isin(['Colombia', 'Guatemala', 'Mexico'])]


Unnamed: 0,Country.of.Origin,Region,Variety,Aroma,Flavor,Body,Sweetness,Total.Cup.Points
2,Guatemala,,Bourbon,8.42,8.50,8.33,10.00,89.75
22,Mexico,xalapa,Other,8.17,8.25,7.83,10.00,87.17
47,Colombia,tolima,Caturra,7.75,7.92,8.08,10.00,86.00
52,Guatemala,nuevo oriente,Bourbon,7.92,8.08,8.08,10.00,85.92
54,Colombia,huila,Caturra,8.00,8.00,7.83,10.00,85.92
...,...,...,...,...,...,...,...,...
1299,Mexico,tlatlauquitepec,Mundo Novo,6.92,6.92,7.17,10.00,71.08
1300,Mexico,veracruz,Bourbon,6.50,6.67,7.33,10.00,71.00
1301,Mexico,"sierra norte yajalon, chiapas",Typica,6.92,7.00,7.42,10.00,70.75
1306,Mexico,juchique de ferrer,Bourbon,7.08,6.83,7.25,10.00,68.33


In [67]:
# Todos los cafés cuya región es conocida

coffe_subset[coffe_subset['Region'].notna()]

Unnamed: 0,Country.of.Origin,Region,Variety,Aroma,Flavor,Body,Sweetness,Total.Cup.Points
0,Ethiopia,guji-hambela,,8.67,8.83,8.50,10.00,90.58
1,Ethiopia,guji-hambela,Other,8.75,8.67,8.42,10.00,89.92
3,Ethiopia,oromia,,8.17,8.58,8.50,10.00,89.00
4,Ethiopia,guji-hambela,Other,8.25,8.50,8.42,10.00,88.83
7,Ethiopia,oromia,,8.25,8.33,8.33,9.33,88.67
...,...,...,...,...,...,...,...,...
1332,India,chikmagalur,,7.58,7.42,7.42,7.42,80.17
1333,United States,chikmagalur,Arusha,7.92,7.50,7.42,7.58,79.33
1334,Ecuador,"san juan, playas",,7.75,7.58,5.08,7.75,78.75
1335,Ecuador,"san juan, playas",,7.50,7.67,5.17,8.42,78.08


In [68]:
# Variedad y aroma de los cafés con más aroma (con una puntuación de aroma de más de 8 puntos)

coffe_subset.loc[coffe_subset['Aroma'] > 8, ["Variety", "Aroma"]]

Unnamed: 0,Variety,Aroma
0,,8.67
1,Other,8.75
2,Bourbon,8.42
3,,8.17
4,Other,8.25
...,...,...
540,Other,8.08
601,Typica,8.08
631,,8.08
712,Bourbon,8.25


In [69]:
# Igual que la anterior sin tener en cuenta cuando no hay variedad o cuando es "Other".

coffe_subset_clean = coffe_subset[coffe_subset['Variety'].notna()]
coffe_subset_clean = coffe_subset_clean[coffe_subset_clean['Variety'] != 'Other']
coffe_subset_clean.loc[coffe_subset_clean['Aroma'] > 8, ["Variety", "Aroma"]]

Unnamed: 0,Variety,Aroma
2,Bourbon,8.42
18,Catimor,8.42
19,Ethiopian Yirgacheffe,8.17
21,Caturra,8.08
25,Bourbon,8.5
27,SL14,8.42
28,Caturra,8.17
32,Bourbon,8.5
33,Caturra,8.17
35,SL34,8.08


In [70]:
# Una lista con todas las variedades de café

coffe_df.Variety.unique()

array([nan, 'Other', 'Bourbon', 'Catimor', 'Ethiopian Yirgacheffe',
       'Caturra', 'SL14', 'Sumatra', 'SL34', 'Hawaiian Kona',
       'Yellow Bourbon', 'SL28', 'Gesha', 'Catuai', 'Pacamara', 'Typica',
       'Sumatra Lintong', 'Mundo Novo', 'Java', 'Peaberry', 'Pacas',
       'Mandheling', 'Ruiru 11', 'Arusha', 'Ethiopian Heirlooms',
       'Moka Peaberry', 'Sulawesi', 'Blue Mountain', 'Marigojipe',
       'Pache Comun'], dtype=object)

## Datos estadísticos

In [71]:
# Puntuación media de los cafés de Etiopía

coffe_subset.loc[coffe_subset['Country.of.Origin'] == "Ethiopia", "Total.Cup.Points"].mean()

85.48409090909091

Los datos estadísticos se pueden ver con `describe()` como vimos al principio.

In [72]:
coffe_subset.describe()

Unnamed: 0,Aroma,Flavor,Body,Sweetness,Total.Cup.Points
count,1339.0,1339.0,1339.0,1339.0,1339.0
mean,7.566706,7.520426,7.517498,9.856692,82.089851
std,0.37756,0.398442,0.370064,0.616102,3.500575
min,0.0,0.0,0.0,0.0,0.0
25%,7.42,7.33,7.33,10.0,81.08
50%,7.58,7.58,7.5,10.0,82.5
75%,7.75,7.75,7.67,10.0,83.67
max,8.75,8.83,8.58,10.0,90.58


In [73]:
# Media de dulzor (Sweetness) en función del pais

coffe_subset[["Country.of.Origin", "Sweetness"]].groupby("Country.of.Origin").mean()

Unnamed: 0_level_0,Sweetness
Country.of.Origin,Unnamed: 1_level_1
Brazil,9.949394
Burundi,10.0
China,9.91625
Colombia,9.952678
Costa Rica,9.908431
Cote d?Ivoire,10.0
Ecuador,8.723333
El Salvador,9.808571
Ethiopia,9.863409
Guatemala,9.870884


In [74]:
# ¿Cuántos cafés se han contabilizado por cada pais?

coffe_subset.groupby("Country.of.Origin")["Country.of.Origin"].count()

Unnamed: 0_level_0,Country.of.Origin
Country.of.Origin,Unnamed: 1_level_1
Brazil,132
Burundi,2
China,16
Colombia,183
Costa Rica,51
Cote d?Ivoire,1
Ecuador,3
El Salvador,21
Ethiopia,44
Guatemala,181


## Ordenación

In [75]:
coffe_df.sort_values(by="Country.of.Origin")

Unnamed: 0.1,Unnamed: 0,Species,Owner,Country.of.Origin,Farm.Name,Lot.Number,Mill,ICO.Number,Company,Altitude,...,Color,Category.Two.Defects,Expiration,Certification.Body,Certification.Address,Certification.Contact,unit_of_measurement,altitude_low_meters,altitude_high_meters,altitude_mean_meters
1075,1075,Arabica,jacques pereira carneiro,Brazil,pereira estate coffee,,carapina armazens gerais,002/135-2/0182,exportadora de cafés carmo de minas ltda,1250,...,,10,"April 17th, 2015",Specialty Coffee Association,36d0d00a3724338ba7937c52a378d085f2172daa,0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660,m,1250.0,1250.0,1250.0
730,730,Arabica,nucoffee,Brazil,sitío santa luzia,,,002/1251/0243,nucoffee,1100m,...,Green,2,"April 11th, 2013",NUCOFFEE,567f200bcc17a90070cb952647bf88141ad9c80c,aa2ff513ffb9c844462a1fb07c599bce7f3bb53d,m,1100.0,1100.0,1100.0
483,483,Arabica,bourbon specialty coffees,Brazil,,,,002/4542/0477,bourbon specialty coffees,,...,Green,13,"April 19th, 2016",Brazil Specialty Coffee Association,3297cfa4c538e3dd03f72cc4082c54f7999e1f9d,8900f0bf1d0b2bafe6807a73562c7677d57eb980,m,,,
987,987,Arabica,gregorio sebba,Brazil,fazenda são josé mirante,14,garca armazens,,garca armazens,695,...,Green,8,"June 21st, 2018",Specialty Coffee Association,36d0d00a3724338ba7937c52a378d085f2172daa,0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660,m,695.0,695.0,695.0
484,484,Arabica,ipanema coffees,Brazil,rio verde,,ipanema coffees,002/1660/0064,ipanema coffees,1200,...,Green,1,"October 16th, 2015",Specialty Coffee Association,36d0d00a3724338ba7937c52a378d085f2172daa,0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660,m,1200.0,1200.0,1200.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
554,554,Arabica,royal base corporation,Vietnam,"apollo co., ltd.",,"apollo co., ltd.",,royal base corporation,1040m,...,,17,"July 20th, 2013",Specialty Coffee Association,36d0d00a3724338ba7937c52a378d085f2172daa,0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660,m,1040.0,1040.0,1040.0
564,564,Arabica,"sunvirtue co., ltd.",Vietnam,apollo estate,,apollo estate,,"sunvirtue co., ltd.",1040,...,Green,0,"January 18th, 2017",Specialty Coffee Association,36d0d00a3724338ba7937c52a378d085f2172daa,0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660,m,1040.0,1040.0,1040.0
444,444,Arabica,"sunvirtue co., ltd.",Vietnam,apollo estate,Oriental Paris Natural Coffee,yes,,"sunvirtue co., ltd.",1550,...,,0,"May 8th, 2018",Specialty Coffee Association,36d0d00a3724338ba7937c52a378d085f2172daa,0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660,m,1550.0,1550.0,1550.0
823,823,Arabica,lusso lab,Zambia,mubuyu munali,,,-,lusso coffee lab,1000-1500m,...,,3,"June 20th, 2015",Specialty Coffee Association,36d0d00a3724338ba7937c52a378d085f2172daa,0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660,m,1000.0,1500.0,1250.0


In [76]:
# Cafés ordenados por puntuación (de más a menos)

coffe_subset.sort_values(by="Total.Cup.Points", ascending=False)

Unnamed: 0,Country.of.Origin,Region,Variety,Aroma,Flavor,Body,Sweetness,Total.Cup.Points
0,Ethiopia,guji-hambela,,8.67,8.83,8.50,10.00,90.58
1,Ethiopia,guji-hambela,Other,8.75,8.67,8.42,10.00,89.92
2,Guatemala,,Bourbon,8.42,8.50,8.33,10.00,89.75
3,Ethiopia,oromia,,8.17,8.58,8.50,10.00,89.00
4,Ethiopia,guji-hambela,Other,8.25,8.50,8.42,10.00,88.83
...,...,...,...,...,...,...,...,...
1306,Mexico,juchique de ferrer,Bourbon,7.08,6.83,7.25,10.00,68.33
1307,Haiti,"department d'artibonite , haiti",Typica,6.75,6.58,7.08,6.00,67.92
1308,Nicaragua,jalapa,Caturra,7.25,6.58,6.42,6.00,63.08
1309,Guatemala,nuevo oriente,Catuai,7.50,6.67,7.33,1.33,59.83


In [77]:
# Cafés ordenados por pais (por orden alfabético), región (por orden alfabético) y puntuación (de más a menos)

coffe_subset.sort_values(by=["Country.of.Origin", "Region", "Total.Cup.Points"], ascending=[True, True, False])

Unnamed: 0,Country.of.Origin,Region,Variety,Aroma,Flavor,Body,Sweetness,Total.Cup.Points
987,Brazil,alta paulista (sao paulo),Mundo Novo,7.58,7.42,7.75,9.33,81.08
734,Brazil,brazil matas de minas,Catuai,7.58,7.50,7.67,10.00,82.25
421,Brazil,campos altos - cerrado,,7.58,7.67,7.67,10.00,83.25
637,Brazil,campos altos - cerrado,,7.42,7.42,7.25,10.00,82.58
681,Brazil,campos altos - cerrado,,7.42,7.50,7.25,10.00,82.42
...,...,...,...,...,...,...,...,...
502,Vietnam,vietnam cau dat,Caturra,7.75,7.67,7.42,10.00,83.00
564,Vietnam,vietnam tutra,Other,7.25,7.58,7.75,10.00,82.83
1338,Vietnam,,,6.75,6.67,6.92,6.67,73.75
823,Zambia,mubuyu estate,SL28,7.67,7.08,7.75,10.00,81.92
