# **01. Estimados de locación y variabilidad**

In [29]:
import pandas as pd
from scipy import stats

## Limpieza de datos

In [9]:
survey = pd.read_csv('/content/drive/MyDrive/BEDU/COVID-19 Survey Student Responses.csv')
df = survey.copy()

In [10]:
df.isna().sum()

ID                                                                                     0
Region of residence                                                                    0
Age of Subject                                                                         0
Time spent on Online Class                                                             0
Rating of Online Class experience                                                     24
Medium for online class                                                               51
Time spent on self study                                                               0
Time spent on fitness                                                                  0
Time spent on sleep                                                                    0
Time spent on social media                                                             0
Prefered social media platform                                                         0
Time spent on TV     

In [11]:
df['Medium for online class'] = df['Medium for online class'].fillna('No medium')

In [12]:
df['Rating of Online Class experience'] = df['Rating of Online Class experience'].fillna('No rating')

## Identificación de las columnas del dataset con datos numéricos.

In [14]:
df.head()

Unnamed: 0,ID,Region of residence,Age of Subject,Time spent on Online Class,Rating of Online Class experience,Medium for online class,Time spent on self study,Time spent on fitness,Time spent on sleep,Time spent on social media,Prefered social media platform,Time spent on TV,Number of meals per day,Change in your weight,Health issue during lockdown,Stress busters,Time utilized,"Do you find yourself more connected with your family, close friends , relatives ?",What you miss the most
0,R1,Delhi-NCR,21,2.0,Good,Laptop/Desktop,4.0,0.0,7.0,3.0,Linkedin,1,4,Increased,NO,Cooking,YES,YES,School/college
1,R2,Delhi-NCR,21,0.0,Excellent,Smartphone,0.0,2.0,10.0,3.0,Youtube,0,3,Decreased,NO,Scrolling through social media,YES,NO,Roaming around freely
2,R3,Delhi-NCR,20,7.0,Very poor,Laptop/Desktop,3.0,0.0,6.0,2.0,Linkedin,0,3,Remain Constant,NO,Listening to music,NO,YES,Travelling
3,R4,Delhi-NCR,20,3.0,Very poor,Smartphone,2.0,1.0,6.0,5.0,Instagram,0,3,Decreased,NO,Watching web series,NO,NO,"Friends , relatives"
4,R5,Delhi-NCR,21,3.0,Good,Laptop/Desktop,3.0,1.0,8.0,3.0,Instagram,1,4,Remain Constant,NO,Social Media,NO,NO,Travelling


La primera columna de interés para el análisis es `Age of subject`, es decir, las edades de los estudiantes encuestados. A partir de ella, nuestros datos numéricos son aquellos que no son de tipo objeto.

In [20]:
df.dtypes

ID                                                                                     object
Region of residence                                                                    object
Age of Subject                                                                          int64
Time spent on Online Class                                                            float64
Rating of Online Class experience                                                      object
Medium for online class                                                                object
Time spent on self study                                                              float64
Time spent on fitness                                                                 float64
Time spent on sleep                                                                   float64
Time spent on social media                                                            float64
Prefered social media platform                              

Ahora visualizamos sólo las columnas numéricas de nuestro Dataframe original.

In [25]:
df_num = df.select_dtypes('number').head()
df_num

Unnamed: 0,Age of Subject,Time spent on Online Class,Time spent on self study,Time spent on fitness,Time spent on sleep,Time spent on social media,Number of meals per day
0,21,2.0,4.0,0.0,7.0,3.0,4
1,21,0.0,0.0,2.0,10.0,3.0,3
2,20,7.0,3.0,0.0,6.0,2.0,3
3,20,3.0,2.0,1.0,6.0,5.0,3
4,21,3.0,3.0,1.0,8.0,3.0,4



## Identificación de la relevancia de las columnas.


---


*   La columna `Age of subject` es importante para obtener conclusiones en relación a la edad, que está directamente relacionada con la madurez, pensamientos y comportamiento.
*   Las columnas `Time spent on Online Class` ... `Time spent on social media` nos permite estudiar los hábitos de estudio de los encuestados.
*   La columna `Number of meals per day` nos permite relacionar los hábitos alimenticios con la productividad, la salud y las relaciones personales.





## Obtención de estimados para las columnas numéricas

---

### *Promedio.*



In [63]:
df_num.mean()

Age of Subject                20.6
Time spent on Online Class     3.0
Time spent on self study       2.4
Time spent on fitness          0.8
Time spent on sleep            7.4
Time spent on social media     3.2
Number of meals per day        3.4
dtype: float64

### *Mediana.*

In [62]:
df_num.median()

Age of Subject                21.0
Time spent on Online Class     3.0
Time spent on self study       3.0
Time spent on fitness          1.0
Time spent on sleep            7.0
Time spent on social media     3.0
Number of meals per day        3.0
dtype: float64

### *Media truncada.*

---

Al aumentar el proncentaje de datos extremos eliminados, la variación de la media para el 10% es casi imperceptible y para el 20% tampoco se encuentran diferencias significativas.


In [64]:
for i in range(7):
  print(df_num.columns[i], stats.trim_mean(df_num, 0.1)[i])

Age of Subject 20.6
Time spent on Online Class 3.0
Time spent on self study 2.4
Time spent on fitness 0.8
Time spent on sleep 7.4
Time spent on social media 3.2
Number of meals per day 3.4


In [65]:
for i in range(7):
  print(df_num.columns[i], round(stats.trim_mean(df_num, 0.2)[i],2))

Age of Subject 20.67
Time spent on Online Class 2.67
Time spent on self study 2.67
Time spent on fitness 0.67
Time spent on sleep 7.0
Time spent on social media 3.0
Number of meals per day 3.33


### *Desviación estándar.*

In [61]:
df_num.std()

Age of Subject                0.547723
Time spent on Online Class    2.549510
Time spent on self study      1.516575
Time spent on fitness         0.836660
Time spent on sleep           1.673320
Time spent on social media    1.095445
Number of meals per day       0.547723
dtype: float64

### *Rango.*

In [71]:
df_num.max(axis=0)

Age of Subject                21.0
Time spent on Online Class     7.0
Time spent on self study       4.0
Time spent on fitness          2.0
Time spent on sleep           10.0
Time spent on social media     5.0
Number of meals per day        4.0
dtype: float64

In [69]:
df_num.min(axis=0)

Age of Subject                20.0
Time spent on Online Class     0.0
Time spent on self study       0.0
Time spent on fitness          0.0
Time spent on sleep            6.0
Time spent on social media     2.0
Number of meals per day        3.0
dtype: float64

In [70]:
df_num.max(axis=0)-df_num.min(axis=0)

Age of Subject                1.0
Time spent on Online Class    7.0
Time spent on self study      4.0
Time spent on fitness         2.0
Time spent on sleep           4.0
Time spent on social media    3.0
Number of meals per day       1.0
dtype: float64

### *Percentiles.*

In [94]:
def per(df, value):
  return [df[i].quantile(value) for i in df_num]

datos = {
    'Percentil 10': per(df_num, 0.1),
    'Percentil 25': per(df_num, 0.25),
    'Percentil 50': per(df_num, 0.5),
    'Percentil 75': per(df_num, 0.75),
    'Percentil 90': per(df_num, 0.9)
}

df_per = pd.DataFrame(datos)
df_per.index = df_num.columns

df_per

Unnamed: 0,Percentil 10,Percentil 25,Percentil 50,Percentil 75,Percentil 90
Age of Subject,20.0,20.0,21.0,21.0,21.0
Time spent on Online Class,0.8,2.0,3.0,3.0,5.4
Time spent on self study,0.8,2.0,3.0,3.0,3.6
Time spent on fitness,0.0,0.0,1.0,1.0,1.6
Time spent on sleep,6.0,6.0,7.0,8.0,9.2
Time spent on social media,2.4,3.0,3.0,3.0,4.2
Number of meals per day,3.0,3.0,3.0,4.0,4.0


### *Rango intercuartil.*

In [75]:
df_num.quantile(0.75) - df_num.quantile(0.25)

Age of Subject                1.0
Time spent on Online Class    1.0
Time spent on self study      1.0
Time spent on fitness         1.0
Time spent on sleep           2.0
Time spent on social media    0.0
Number of meals per day       1.0
dtype: float64