# Foundations of Data Science - CMU Portugal Academy

> 
> Instructors:
>   - David Semedo (df.semedo@fct.unl.pt)
>   - Rafael Ferreira (rah.ferreira@fct.unl.pt)
>

In [82]:
import numpy as np
import pandas as pd


## Reference dataset - Mental Illness ([Link](https://www.kaggle.com/datasets/imtkaggleteam/mental-health))

We will take the "Mental Health" dataset as reference, to introduce a set of Pandas operations.

**Motivation**:

* Mental health is an essential part of people’s lives and society. Poor mental health affects our well-being, our ability to work, and our relationships with friends, family, and community.

* Mental health conditions are not uncommon. Hundreds of millions suffer from them yearly, and many more do over their lifetimes. It’s estimated that 1 in 3 women and 1 in 5 men will experience major depression in their lives. Other conditions, such as schizophrenia and bipolar disorder, are less common but still have a large impact on people’s lives.


In [83]:
dataset_path = "datasets/1- mental-illnesses-prevalence.csv"
df = pd.read_csv(dataset_path)

In [None]:
df

### Obtain an overview and basic informations about the DataFrame

In [None]:
df.info()

In [None]:
df.head() # Inspect the first 5. We can provide the size: df.head(20)

In [None]:
df.columns

#### Single column Inspection 

In [None]:
df["Year"]

The result of the operation above, is a Pandas Series:

In [None]:
type(df["Year"])

### Removing columns

We can remove columns with the `DataFrame.drop()` function ([docs]([docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html))). There are multiple ways:
* By specifying the axis index (0 for rows, 1 for columns): `df.drop(<column name string>, axis=1)`
* By using the keyword argument `columns`: `df.drop(columns=<list of columns>)`


Keep in mind that the drop operation, by default, is not an "inplace" operation, meaning that it returns a modified version of the original dataframe. 

In [None]:
print(f"Num columns: {len(df.columns)}")
df.drop(columns=["Code"])  # Equivalent to df.drop("Code", axis=1)
print(f"Num columns after drop: {len(df.columns)}")


In [None]:
# Set the inplace argument to True, or assign the returned DataFrame to the original variable:
df.drop(columns=["Code"], inplace=True)
# or
#df = df.drop(columns=["Code"])
print(f"Num columns after drop: {len(df.columns)}")


### Quantitative Variables: Obtain descriptive statistics from a single column.

In [None]:
df["Year"].describe()

Find the maximum and minimum values of a given column:

In [None]:
df["Year"].max(), df["Year"].min()

### Categorical Values

For categorical values, we might want to know the domain size (i.e. how many unique values).

In [None]:
unique_entities = df["Entity"].unique() # Produces a NumPy array
unique_entities

In [None]:
len(unique_entities)

Another interesting operation, would be to know the distribution of the different values:

In [None]:
df["Entity"].value_counts().iloc[:10]

We can index the resulting Series with the .iloc, and then use regular Python indexing:

In [None]:
df.columns

In [None]:
# Inspect the first 15 values
df["Entity"].value_counts().iloc[:15]

We can test a condition over each row:

In [65]:
anxiety_column = "Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized"

In [None]:
df[anxiety_column] 

In [None]:
df[anxiety_column] > 4

This condition can be used to index the DataFrame, and obtain the rows for which the condition is True:

In [None]:
df[df[anxiety_column] > 4]

If you wanted to do the same, but keep only a subset of the columns, we could used the .loc indexing:

In [None]:
df.loc[df[anxiety_column] > 4, ["Entity", "Year"]]

## Exercises

### Ex 1 - Descriptive statistics of the column Year

Analyze the column the Year's distribution. What is its range, mean and std?

In [70]:
# Your code goes here

Obtain the distribution of the column Year. 
Comment on the dataset balance, based on the number of samples per year.

In [71]:
# Your code goes here

### Ex 2 - Compute the correlation between two columns

Use the function `DataFrame.corr` to find the Pearson correlation between all pairs of following columns:

* Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized
* Depressive disorders (share of population) - Sex: Both - Age: Age-standardized
* Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized
* Bipolar disorders (share of population) - Sex: Both - Age: Age-standardized
* Eating disorders (share of population) - Sex: Both - Age: Age-standardized

Comment on the obtained results.

In [72]:
# Your code goes here

## Exercise 3 - Find the country with highest rate of Anxiety disorder

Hint: The function .max() can be used to find the maximum value of a column. The function .argmax() gives you the index of that maximum value.

In [74]:
# Your code goes here

## Exercise 4 - Remove outliers

Removing outliers is a critical step when processing data.

1. Pick one of the quantitative columns that represent a share of the population.
2. Define an outlier as any value x, that is above two standard deviations.
3. Remove all rows in which their corresponding values, exceed two standard deviations. 
4. Re-compute all the descriptive statistics over that column and compare the differences.

Hint: In indexing, you can combine more than one condition (e.g. `df[(df["column"] > a) & (df["column"] < b)]`).

In [75]:
# Your code goes here