# Foundations of Data Science - CMU Portugal Academy

> 
> Instructors:
>   - David Semedo (df.semedo@fct.unl.pt)
>   - Rafael Ferreira (rah.ferreira@fct.unl.pt)
>

In [49]:
import numpy as np
import pandas as pd


## Reference dataset - Mental Illness ([Link](https://www.kaggle.com/datasets/imtkaggleteam/mental-health))

We will take the "Mental Health" dataset as reference, to introduce a set of Pandas operations.

**Motivation**:

* Mental health is an essential part of people’s lives and society. Poor mental health affects our well-being, our ability to work, and our relationships with friends, family, and community.

* Mental health conditions are not uncommon. Hundreds of millions suffer from them yearly, and many more do over their lifetimes. It’s estimated that 1 in 3 women and 1 in 5 men will experience major depression in their lives. Other conditions, such as schizophrenia and bipolar disorder, are less common but still have a large impact on people’s lives.


In [50]:
dataset_path = "datasets/1- mental-illnesses-prevalence.csv"
df = pd.read_csv(dataset_path)

In [51]:
df

Unnamed: 0,Entity,Code,Year,Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized,Depressive disorders (share of population) - Sex: Both - Age: Age-standardized,Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized,Bipolar disorders (share of population) - Sex: Both - Age: Age-standardized,Eating disorders (share of population) - Sex: Both - Age: Age-standardized
0,Afghanistan,AFG,1990,0.223206,4.996118,4.713314,0.703023,0.127700
1,Afghanistan,AFG,1991,0.222454,4.989290,4.702100,0.702069,0.123256
2,Afghanistan,AFG,1992,0.221751,4.981346,4.683743,0.700792,0.118844
3,Afghanistan,AFG,1993,0.220987,4.976958,4.673549,0.700087,0.115089
4,Afghanistan,AFG,1994,0.220183,4.977782,4.670810,0.699898,0.111815
...,...,...,...,...,...,...,...,...
6415,Zimbabwe,ZWE,2015,0.201042,3.407624,3.184012,0.538596,0.095652
6416,Zimbabwe,ZWE,2016,0.201319,3.410755,3.187148,0.538593,0.096662
6417,Zimbabwe,ZWE,2017,0.201639,3.411965,3.188418,0.538589,0.097330
6418,Zimbabwe,ZWE,2018,0.201976,3.406929,3.172111,0.538585,0.097909


### Obtain an overview and basic informations about the DataFrame

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6420 entries, 0 to 6419
Data columns (total 8 columns):
 #   Column                                                                             Non-Null Count  Dtype  
---  ------                                                                             --------------  -----  
 0   Entity                                                                             6420 non-null   object 
 1   Code                                                                               6150 non-null   object 
 2   Year                                                                               6420 non-null   int64  
 3   Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized  6420 non-null   float64
 4   Depressive disorders (share of population) - Sex: Both - Age: Age-standardized     6420 non-null   float64
 5   Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized        6420 non-null   float6

In [53]:
df.head() # Inspect the first 5. We can provide the size: df.head(20)

Unnamed: 0,Entity,Code,Year,Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized,Depressive disorders (share of population) - Sex: Both - Age: Age-standardized,Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized,Bipolar disorders (share of population) - Sex: Both - Age: Age-standardized,Eating disorders (share of population) - Sex: Both - Age: Age-standardized
0,Afghanistan,AFG,1990,0.223206,4.996118,4.713314,0.703023,0.1277
1,Afghanistan,AFG,1991,0.222454,4.98929,4.7021,0.702069,0.123256
2,Afghanistan,AFG,1992,0.221751,4.981346,4.683743,0.700792,0.118844
3,Afghanistan,AFG,1993,0.220987,4.976958,4.673549,0.700087,0.115089
4,Afghanistan,AFG,1994,0.220183,4.977782,4.67081,0.699898,0.111815


In [54]:
df.columns

Index(['Entity', 'Code', 'Year',
       'Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized',
       'Depressive disorders (share of population) - Sex: Both - Age: Age-standardized',
       'Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized',
       'Bipolar disorders (share of population) - Sex: Both - Age: Age-standardized',
       'Eating disorders (share of population) - Sex: Both - Age: Age-standardized'],
      dtype='object')

#### Single column Inspection 

In [55]:
df["Year"]

0       1990
1       1991
2       1992
3       1993
4       1994
        ... 
6415    2015
6416    2016
6417    2017
6418    2018
6419    2019
Name: Year, Length: 6420, dtype: int64

The result of the operation above, is a Pandas Series:

In [56]:
type(df["Year"])

pandas.core.series.Series

### Removing columns

We can remove columns with the `DataFrame.drop()` function ([docs]([docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html))). There are multiple ways:
* By specifying the axis index (0 for rows, 1 for columns): `df.drop(<column name string>, axis=1)`
* By using the keyword argument `columns`: `df.drop(columns=<list of columns>)`


Keep in mind that the drop operation, by default, is not an "inplace" operation, meaning that it returns a modified version of the original dataframe. 

In [57]:
print(f"Num columns: {len(df.columns)}")
df.drop(columns=["Code"])  # Equivalent to df.drop("Code", axis=1)
print(f"Num columns after drop: {len(df.columns)}")


Num columns: 8
Num columns after drop: 8


In [58]:
# Set the inplace argument to True, or assign the returned DataFrame to the original variable:
df.drop(columns=["Code"], inplace=True)
# or
#df = df.drop(columns=["Code"])
print(f"Num columns after drop: {len(df.columns)}")


Num columns after drop: 7


### Quantitative Variables: Obtain descriptive statistics from a single column.

In [59]:
df["Year"].describe()

count    6420.000000
mean     2004.500000
std         8.656116
min      1990.000000
25%      1997.000000
50%      2004.500000
75%      2012.000000
max      2019.000000
Name: Year, dtype: float64

Find the maximum and minimum values of a given column:

In [60]:
df["Year"].max(), df["Year"].min()

(2019, 1990)

### Categorical Values

For categorical values, we might want to know the domain size (i.e. how many unique values).

In [61]:
unique_entities = df["Entity"].unique() # Produces a NumPy array
unique_entities

array(['Afghanistan', 'Africa (IHME GBD)', 'Albania', 'Algeria',
       'America (IHME GBD)', 'American Samoa', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Asia (IHME GBD)',
       'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain',
       'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin',
       'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso',
       'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde',
       'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia',
       'Comoros', 'Congo', 'Cook Islands', 'Costa Rica', "Cote d'Ivoire",
       'Croatia', 'Cuba', 'Cyprus', 'Czechia',
       'Democratic Republic of Congo', 'Denmark', 'Djibouti', 'Dominica',
       'Dominican Republic', 'East Timor', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
       'Eswatini', 'Ethiopia', 'Europe (IHME GBD)', 'Europe

In [62]:
len(unique_entities)

214

Another interesting operation, would be to know the distribution of the different values:

In [63]:
df["Entity"].value_counts().iloc[:10]

Entity
Afghanistan                 30
Netherlands                 30
Nicaragua                   30
Niger                       30
Nigeria                     30
Niue                        30
North Korea                 30
North Macedonia             30
Northern Mariana Islands    30
Norway                      30
Name: count, dtype: int64

We can index the resulting Series with the .iloc, and then use regular Python indexing:

In [64]:
df.columns

Index(['Entity', 'Year',
       'Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized',
       'Depressive disorders (share of population) - Sex: Both - Age: Age-standardized',
       'Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized',
       'Bipolar disorders (share of population) - Sex: Both - Age: Age-standardized',
       'Eating disorders (share of population) - Sex: Both - Age: Age-standardized'],
      dtype='object')

In [65]:
# Inspect the first 15 values
df["Entity"].value_counts().iloc[:15]

Entity
Afghanistan                 30
Netherlands                 30
Nicaragua                   30
Niger                       30
Nigeria                     30
Niue                        30
North Korea                 30
North Macedonia             30
Northern Mariana Islands    30
Norway                      30
Oman                        30
Pakistan                    30
Palau                       30
Palestine                   30
Panama                      30
Name: count, dtype: int64

We can test a condition over each row:

In [66]:
anxiety_column = "Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized"

In [67]:
df[anxiety_column] 

0       4.713314
1       4.702100
2       4.683743
3       4.673549
4       4.670810
          ...   
6415    3.184012
6416    3.187148
6417    3.188418
6418    3.172111
6419    3.137017
Name: Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized, Length: 6420, dtype: float64

In [68]:
df[anxiety_column] > 4

0        True
1        True
2        True
3        True
4        True
        ...  
6415    False
6416    False
6417    False
6418    False
6419    False
Name: Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized, Length: 6420, dtype: bool

This condition can be used to index the DataFrame, and obtain the rows for which the condition is True:

In [69]:
df[df[anxiety_column] > 4]

Unnamed: 0,Entity,Year,Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized,Depressive disorders (share of population) - Sex: Both - Age: Age-standardized,Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized,Bipolar disorders (share of population) - Sex: Both - Age: Age-standardized,Eating disorders (share of population) - Sex: Both - Age: Age-standardized
0,Afghanistan,1990,0.223206,4.996118,4.713314,0.703023,0.127700
1,Afghanistan,1991,0.222454,4.989290,4.702100,0.702069,0.123256
2,Afghanistan,1992,0.221751,4.981346,4.683743,0.700792,0.118844
3,Afghanistan,1993,0.220987,4.976958,4.673549,0.700087,0.115089
4,Afghanistan,1994,0.220183,4.977782,4.670810,0.699898,0.111815
...,...,...,...,...,...,...,...
6355,Yemen,2015,0.229845,4.892351,4.778325,0.725891,0.140348
6356,Yemen,2016,0.228970,4.884568,4.767840,0.725918,0.136837
6357,Yemen,2017,0.227927,4.874499,4.758836,0.725949,0.132334
6358,Yemen,2018,0.226961,4.880210,4.765439,0.725967,0.127744


If you wanted to do the same, but keep only a subset of the columns, we could used the .loc indexing:

In [70]:
df.loc[df[anxiety_column] > 4, ["Entity", "Year"]]

Unnamed: 0,Entity,Year
0,Afghanistan,1990
1,Afghanistan,1991
2,Afghanistan,1992
3,Afghanistan,1993
4,Afghanistan,1994
...,...,...
6355,Yemen,2015
6356,Yemen,2016
6357,Yemen,2017
6358,Yemen,2018


## Exercises

### Ex 1 - Descriptive statistics of the column Year

Analyze the column the Year's distribution. What is its range, mean and std?

In [71]:
df["Year"].describe()

count    6420.000000
mean     2004.500000
std         8.656116
min      1990.000000
25%      1997.000000
50%      2004.500000
75%      2012.000000
max      2019.000000
Name: Year, dtype: float64

Obtain the distribution of the column Year. 
Comment on the dataset balance, based on the number of samples per year.

In [72]:
df["Year"].value_counts()

Year
1990    214
1991    214
2018    214
2017    214
2016    214
2015    214
2014    214
2013    214
2012    214
2011    214
2010    214
2009    214
2008    214
2007    214
2006    214
2005    214
2004    214
2003    214
2002    214
2001    214
2000    214
1999    214
1998    214
1997    214
1996    214
1995    214
1994    214
1993    214
1992    214
2019    214
Name: count, dtype: int64

### Ex 2 - Compute the correlation between two columns

Use the function `DataFrame.corr` to find the Pearson correlation between all pairs of following columns:

* Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized
* Depressive disorders (share of population) - Sex: Both - Age: Age-standardized
* Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized
* Bipolar disorders (share of population) - Sex: Both - Age: Age-standardized
* Eating disorders (share of population) - Sex: Both - Age: Age-standardized

Comment on the obtained results.

In [73]:
df["Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized"].corr(
                    df['Depressive disorders (share of population) - Sex: Both - Age: Age-standardized'])

-0.4749936266867936

In [74]:
import itertools

subset_columns = df.columns[3:]
for col1_i, col2_i in itertools.combinations(range(len(subset_columns)), 2):
    col1, col2 = subset_columns[col1_i], subset_columns[col2_i]
    correlation = df[col1].corr(df[col2])
    print(f"Correlations between columns:\n\t- {col1}\n\t- {col2}\n\t- Correlation: {correlation}")
    print()

Correlations between columns:
	- Depressive disorders (share of population) - Sex: Both - Age: Age-standardized
	- Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized
	- Correlation: 0.1144288981493943

Correlations between columns:
	- Depressive disorders (share of population) - Sex: Both - Age: Age-standardized
	- Bipolar disorders (share of population) - Sex: Both - Age: Age-standardized
	- Correlation: 0.15303927851123564

Correlations between columns:
	- Depressive disorders (share of population) - Sex: Both - Age: Age-standardized
	- Eating disorders (share of population) - Sex: Both - Age: Age-standardized
	- Correlation: -0.05206717118327932

Correlations between columns:
	- Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized
	- Bipolar disorders (share of population) - Sex: Both - Age: Age-standardized
	- Correlation: 0.5762304503027424

Correlations between columns:
	- Anxiety disorders (share of population) - Sex: Both - Age: A

## Exercise 3 - Find the country with highest rate of Anxiety disorder

Hint: The function .max() can be used to find the maximum value of a column. The function .argmax() gives you the index of that maximum value.

In [75]:
imax = df["Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized"].argmax()

In [76]:
df.iloc[imax]

Entity                                                                                 Brazil
Year                                                                                     2006
Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized    0.275419
Depressive disorders (share of population) - Sex: Both - Age: Age-standardized       4.467471
Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized          8.624634
Bipolar disorders (share of population) - Sex: Both - Age: Age-standardized          1.112257
Eating disorders (share of population) - Sex: Both - Age: Age-standardized           0.211118
Name: 856, dtype: object

## Exercise 4 - Remove outliers

Removing outliers is a critical step when processing data.

1. Pick one of the quantitative columns that represent a share of the population.
2. Define an outlier as any value x, that is above two standard deviations.
3. Remove all rows in which their corresponding values, exceed two standard deviations. 
4. Re-compute all the descriptive statistics over that column and compare the differences.

Hint: In indexing, you can combine more than one condition (e.g. `df[(df["column"] > a) & (df["column"] < b)]`).

In [77]:
# Your code goes here
column_name = "Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized"
df_column = df[column_name]
mean_column = df_column.mean()
std_column = df_column.std()
threshold_upper = mean_column + 2 * std_column
threshold_lower = mean_column - 2 * std_column

print(mean_column, std_column, threshold_upper, threshold_lower)

4.101839659922118 1.0505430995124705 6.202925858947059 2.000753460897177


In [78]:
df_no_outliers = df[ (df[column_name] >= threshold_lower) & (df[column_name] <= threshold_upper) ]

In [79]:
df_no_outliers[column_name]

0       4.713314
1       4.702100
2       4.683743
3       4.673549
4       4.670810
          ...   
6415    3.184012
6416    3.187148
6417    3.188418
6418    3.172111
6419    3.137017
Name: Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized, Length: 6059, dtype: float64

In [80]:
len(df), len(df_no_outliers)

(6420, 6059)