# Ayudantía : I1 Introducción a ciencia de datos


Datos : `Caffeine Collective`, archivo csv que contiene información sobre pedidos realizados en cafeterias.
Link a la página oficial de Kaggle : [kaggle db](https://www.kaggle.com/datasets/ayeshaimran123/caffeine-collective)

Notebook: Francisca Yepsen
Si tienen dudas pueden escribirme al correo : fyepsen@uc.cl

### 1. Cargar los datos
Existe la opción de utilizar jupyter notebook y tener los archivos descargados de forma local, montar sus propias cuentas de google drive y utilizar el ambiente de google colab para trabajar o subir los archivos al colab temporalmente.
#### 1.1 Librerías
Recuerden como buena práctica importar al inicio las librerías que utilizarán a lo largo del notebook.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
dt_cafe = pd.read_csv('Coffe_sales.csv') # utilizando pd , se lee el csv como una tabla de pandas.

Si el archivo fuera un parquet se podría trabajar así :

```
import pyarrow.parquet as pq

cafes = pq.read_table('cafes.parquet')
cafes = cafes.to_pandas()
```



In [None]:
dt_cafe.head(20) # Función que lee las primeras entradas del data frame. En este caso se especifica mostrar solo 5

Unnamed: 0,hour_of_day,cash_type,money,coffee_name,Time_of_Day,Weekday,Month_name,Weekdaysort,Monthsort,Date,Time
0,10,card,38.7,Latte,Morning,Fri,Mar,5,3,2024-03-01,10:15:50.520000
1,12,card,38.7,Hot Chocolate,Afternoon,Fri,Mar,5,3,2024-03-01,12:19:22.539000
2,12,card,38.7,Hot Chocolate,Afternoon,Fri,Mar,5,3,2024-03-01,12:20:18.089000
3,13,card,28.9,Americano,Afternoon,Fri,Mar,5,3,2024-03-01,13:46:33.006000
4,13,card,38.7,Latte,Afternoon,Fri,Mar,5,3,2024-03-01,13:48:14.626000
5,15,card,33.8,Americano with Milk,Afternoon,Fri,Mar,5,3,2024-03-01,15:39:47.726000
6,16,card,38.7,Hot Chocolate,Afternoon,Fri,Mar,5,3,2024-03-01,16:19:02.756000
7,18,card,33.8,Americano with Milk,Night,Fri,Mar,5,3,2024-03-01,18:39:03.580000
8,19,card,38.7,Cocoa,Night,Fri,Mar,5,3,2024-03-01,19:22:01.762000
9,19,card,33.8,Americano with Milk,Night,Fri,Mar,5,3,2024-03-01,19:23:15.887000


In [None]:
dt_cafe.tail(5) # Función que lee las últimas entradas del data frame. En este caso se especifica mostrar solo 5

Unnamed: 0,hour_of_day,cash_type,money,coffee_name,Time_of_Day,Weekday,Month_name,Weekdaysort,Monthsort,Date,Time
3542,10,card,35.76,Cappuccino,Morning,Sun,Mar,7,3,2025-03-23,10:34:54.894000
3543,14,card,35.76,Cocoa,Afternoon,Sun,Mar,7,3,2025-03-23,14:43:37.362000
3544,14,card,35.76,Cocoa,Afternoon,Sun,Mar,7,3,2025-03-23,14:44:16.864000
3545,15,card,25.96,Americano,Afternoon,Sun,Mar,7,3,2025-03-23,15:47:28.723000
3546,18,card,35.76,Latte,Night,Sun,Mar,7,3,2025-03-23,18:11:38.635000


### 2. Exploración de los datos
Tiene el objetivo de analizar las categorías, dimensiones, formato, unidades, tipos en que vienen los datos. Considerando que según el objetivo de nuestro trabajo se manejan de distintas formas.

In [None]:
dt_cafe.info() # Muestra columnas, tipo de dato, dimensión de los datos

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3547 entries, 0 to 3546
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   hour_of_day  3547 non-null   int64  
 1   cash_type    3547 non-null   object 
 2   money        3547 non-null   float64
 3   coffee_name  3547 non-null   object 
 4   Time_of_Day  3547 non-null   object 
 5   Weekday      3547 non-null   object 
 6   Month_name   3547 non-null   object 
 7   Weekdaysort  3547 non-null   int64  
 8   Monthsort    3547 non-null   int64  
 9   Date         3547 non-null   object 
 10  Time         3547 non-null   object 
dtypes: float64(1), int64(3), object(7)
memory usage: 304.9+ KB


Con la celda anterior se puede saber que el dataframe tiene 3547 filas y 11 columnas.

Ósea se registraron 3547 pedidos de cafés.

Casos hipotéticos:
1. Reemplazar formato de separación de números ("," -> ".")


```
dt_cafe["money"]= dt_cafe["money"].str.replace(',','.')
```

2. Cambiar tipo de dato de una columna
```
dt_cafe["money"]= dt_cafe["money"].astype("float64")
```

3. Botar una columna

```
dt_cafe_nuevo = dt_cafe["Weekdaysort"].drop()
```
4. Botar duplicados:
El parámetro keep indica cuál ocurrencia se quiere guardar (first o last)
```
dt_cafe = dt_cafe.drop_duplicates(keep = 'first')
```
5. Renombrar columnas
```
dt_cafe = dt_cafe.rename(columns={"money": "monto_pagado","hour_of_day" : "hora"})
```

In [None]:
# Cambiamos el tipo de dato a datetime
dt_cafe['Date'] = dt_cafe['Date'].astype('datetime64[ns]')

Para distintas conversiones o formas de usar el tipo de dato datetime, les dejo el siguiente link que tiene ejemplos de usos [formatos tiempo](https://www.geeksforgeeks.org/python/how-to-convert-datetime-to-date-in-pandas/)



In [None]:
#Revisar si hay valores NaN entre todas las entradas
dt_cafe.isna().sum()

Unnamed: 0,0
hour_of_day,0
cash_type,0
money,0
coffee_name,0
Time_of_Day,0
Weekday,0
Month_name,0
Weekdaysort,0
Monthsort,0
Date,0


In [None]:
#No se usará la columna "Weekdaysort" así que la botaremos
#Dejaremos en un dataframe nuevo el que no tiene la columna
cafe = dt_cafe.drop(columns = ['Weekdaysort'])
cafe.columns

Index(['hour_of_day', 'cash_type', 'money', 'coffee_name', 'Time_of_Day',
       'Weekday', 'Month_name', 'Monthsort', 'Date', 'Time'],
      dtype='object')

In [None]:
#Veamos si hay duplicados : también se puede ver con subsets usando groupby y ver por ciertos atributos
# Se ve si hay filas duplicadas.
cafe.duplicated()

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
3542,False
3543,False
3544,False
3545,False


In [None]:
#Otra opción es crear un dataframe con los duplicados en caso de querer conservarlos
filas_duplicadas = cafe[cafe.duplicated()]
filas_duplicadas

Unnamed: 0,hour_of_day,cash_type,money,coffee_name,Time_of_Day,Weekday,Month_name,Monthsort,Date,Time


In [None]:
#Describe entrega medidas de tendencia por cada columna numérica
cafe.describe()

Unnamed: 0,hour_of_day,money,Monthsort,Date
count,3547.0,3547.0,3547.0,3547
mean,14.185791,31.645216,6.453905,2024-10-04 17:34:43.676346368
min,6.0,18.12,1.0,2024-03-01 00:00:00
25%,10.0,27.92,3.0,2024-07-17 12:00:00
50%,14.0,32.82,7.0,2024-10-10 00:00:00
75%,18.0,35.76,10.0,2025-01-11 00:00:00
max,22.0,38.7,12.0,2025-03-23 00:00:00
std,4.23401,4.877754,3.500754,


In [None]:
# Si quisieran acceder al pedido que gasto la mayor cantidad de dinero registrada
cafe[cafe['money'] == cafe['money'].max()]

Unnamed: 0,hour_of_day,cash_type,money,coffee_name,Time_of_Day,Weekday,Month_name,Monthsort,Date,Time
0,10,card,38.7,Latte,Morning,Fri,Mar,3,2024-03-01,10:15:50.520000
1,12,card,38.7,Hot Chocolate,Afternoon,Fri,Mar,3,2024-03-01,12:19:22.539000
2,12,card,38.7,Hot Chocolate,Afternoon,Fri,Mar,3,2024-03-01,12:20:18.089000
4,13,card,38.7,Latte,Afternoon,Fri,Mar,3,2024-03-01,13:48:14.626000
6,16,card,38.7,Hot Chocolate,Afternoon,Fri,Mar,3,2024-03-01,16:19:02.756000
...,...,...,...,...,...,...,...,...,...,...
274,13,card,38.7,Hot Chocolate,Afternoon,Fri,Apr,4,2024-04-19,13:58:54.064000
275,13,card,38.7,Cappuccino,Afternoon,Fri,Apr,4,2024-04-19,13:59:49.690000
276,18,card,38.7,Cocoa,Night,Fri,Apr,4,2024-04-19,18:23:19.311000
282,13,card,38.7,Hot Chocolate,Afternoon,Sat,Apr,4,2024-04-20,13:10:55.151000


In [None]:
# Si quisieramos ver el tiempo del día donde más se piden cafés según "Time_of_Day"
cafe.groupby('Time_of_Day').count()

Unnamed: 0_level_0,hour_of_day,cash_type,money,coffee_name,Weekday,Month_name,Monthsort,Date,Time
Time_of_Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Afternoon,1205,1205,1205,1205,1205,1205,1205,1205,1205
Morning,1181,1181,1181,1181,1181,1181,1181,1181,1181
Night,1161,1161,1161,1161,1161,1161,1161,1161,1161


In [None]:
# Mostrar los pedidos donde se gastó solo más de 20 dolares
cafe[cafe['money'] > 20]

Unnamed: 0,hour_of_day,cash_type,money,coffee_name,Time_of_Day,Weekday,Month_name,Monthsort,Date,Time
0,10,card,38.70,Latte,Morning,Fri,Mar,3,2024-03-01,10:15:50.520000
1,12,card,38.70,Hot Chocolate,Afternoon,Fri,Mar,3,2024-03-01,12:19:22.539000
2,12,card,38.70,Hot Chocolate,Afternoon,Fri,Mar,3,2024-03-01,12:20:18.089000
3,13,card,28.90,Americano,Afternoon,Fri,Mar,3,2024-03-01,13:46:33.006000
4,13,card,38.70,Latte,Afternoon,Fri,Mar,3,2024-03-01,13:48:14.626000
...,...,...,...,...,...,...,...,...,...,...
3542,10,card,35.76,Cappuccino,Morning,Sun,Mar,3,2025-03-23,10:34:54.894000
3543,14,card,35.76,Cocoa,Afternoon,Sun,Mar,3,2025-03-23,14:43:37.362000
3544,14,card,35.76,Cocoa,Afternoon,Sun,Mar,3,2025-03-23,14:44:16.864000
3545,15,card,25.96,Americano,Afternoon,Sun,Mar,3,2025-03-23,15:47:28.723000


In [None]:
#Tener un datagframe que solo tenga las ventas del invierno

cota_inicio= '2024-06-20'
cota_final = '2024-09-21'

cafe_invierno = cafe[(cafe['Date'] >= cota_inicio) & (cafe['Date'] <= cota_final)]
cafe_invierno

Unnamed: 0,hour_of_day,cash_type,money,coffee_name,Time_of_Day,Weekday,Month_name,Monthsort,Date,Time
742,10,card,37.72,Latte,Morning,Thu,Jun,6,2024-06-20,10:50:06.453000
743,18,card,37.72,Latte,Night,Thu,Jun,6,2024-06-20,18:59:02.082000
744,19,card,37.72,Latte,Night,Thu,Jun,6,2024-06-20,19:00:00.237000
745,21,card,37.72,Latte,Night,Thu,Jun,6,2024-06-20,21:39:10.013000
746,21,card,37.72,Latte,Night,Thu,Jun,6,2024-06-20,21:57:40.554000
...,...,...,...,...,...,...,...,...,...,...
1535,20,card,32.82,Latte,Night,Sat,Sep,9,2024-09-21,20:33:58.175000
1536,20,card,32.82,Latte,Night,Sat,Sep,9,2024-09-21,20:35:30.169000
1537,22,card,27.92,Americano with Milk,Night,Sat,Sep,9,2024-09-21,22:18:46.088000
1538,22,card,23.02,Americano,Night,Sat,Sep,9,2024-09-21,22:19:50.128000


In [None]:
cafe_americano = cafe[(cafe['coffee_name'] == 'Americano with Milk') & (cafe['money'] > 25.05) & (cafe['Date'] > '2024-04-18')]
cafe_americano

Unnamed: 0,hour_of_day,cash_type,money,coffee_name,Time_of_Day,Weekday,Month_name,Monthsort,Date,Time
272,13,card,33.80,Americano with Milk,Afternoon,Fri,Apr,4,2024-04-19,13:11:47.848000
277,18,card,33.80,Americano with Milk,Night,Fri,Apr,4,2024-04-19,18:25:33.440000
278,12,card,33.80,Americano with Milk,Afternoon,Sat,Apr,4,2024-04-20,12:08:35.701000
279,12,card,33.80,Americano with Milk,Afternoon,Sat,Apr,4,2024-04-20,12:09:42.664000
281,13,card,33.80,Americano with Milk,Afternoon,Sat,Apr,4,2024-04-20,13:09:47.905000
...,...,...,...,...,...,...,...,...,...,...
3530,10,card,30.86,Americano with Milk,Morning,Sat,Mar,3,2025-03-22,10:30:09.403000
3533,12,card,30.86,Americano with Milk,Afternoon,Sat,Mar,3,2025-03-22,12:18:27.491000
3536,13,card,30.86,Americano with Milk,Afternoon,Sat,Mar,3,2025-03-22,13:23:17.918000
3539,17,card,30.86,Americano with Milk,Night,Sat,Mar,3,2025-03-22,17:53:35.942000


In [None]:
#Orderna el data frame bajo ciertos criterios
cafe_americano.sort_values(by=["hour_of_day", "money"], ascending=[True, True]).head()

Unnamed: 0,hour_of_day,cash_type,money,coffee_name,Time_of_Day,Weekday,Month_name,Monthsort,Date,Time
3211,6,card,30.86,Americano with Milk,Morning,Fri,Feb,2,2025-02-28,06:52:45.591000
3212,6,card,30.86,Americano with Milk,Morning,Fri,Feb,2,2025-02-28,06:54:59.973000
998,7,card,27.92,Americano with Milk,Morning,Tue,Jul,7,2024-07-30,07:41:10.945000
1022,7,card,27.92,Americano with Milk,Morning,Wed,Jul,7,2024-07-31,07:59:52.098000
1044,7,card,27.92,Americano with Milk,Morning,Thu,Aug,8,2024-08-01,07:31:00.085000


Bibliografía de ayuda:
1. [Notebook de exploración de datos](https://www.kaggle.com/code/kashnitsky/topic-1-exploratory-data-analysis-with-pandas)
2. [Shortcuts pandas](https://medium.com/data-science-collective/5-pandas-basics-every-junior-data-scientist-should-master-first-2bd0cbcdb79a)
3.