# 0101 - First Session With Python - Solution Notebook

* Written by Alexandre Gazagnes
* Last update: 2024-02-01

## About 

### Using Jupyter

You have 3 options: 
- Locally: 

    - **Install Anaconda https://www.anaconda.com/ or Jupyter https://jupyter.org/install on your machine**

    - Use Anaconda or Jupyter installed on the Unilasalle PC (**Warning ⚠️**: some packages may be missing) 


- Online:

    - **Use Google Colab https://colab.research.google.com/** (you have to be connected to your google account)

    - **Open this notebook on Google colab URL**
        * Badge

    - Use Jupyter online  https://jupyter.org/try-jupyter (**Warning ⚠️**: External packages cannot be installed) 


### Material

All the material for this course could be found here.
- https://github.com/AlexandreGazagnes/Unilassalle-Public-Ressources/tree/main/4a-data-analysis

### Python / Jupyter ? 

Few Questions : 
- Why Python
- Python vs R ? 
- What is Data Analysis ? 
- What are we talking about ? 
- What is Jupyter ?

### Context

You are a new employee of the NPO named "NPO".

You are in charged of data analysis.

First project is about GHG emissions, more precisely regarding Bovine Meat.

### Data

After a quick look on the internet, you find a very interesting dataset on the FAO website. It contains a list of various indicators. You decide to use this dataset to identify segments of countries.

- Find relevant data : 
    - https://www.kaggle.com/datasets/unitednations/global-food-agriculture-statistics
    - https://www.kaggle.com/datasets/dorbicycle/world-foodfeed-production
    - https://www.fao.org/faostat/en/
    - https://fr-en.openfoodfacts.org/
    - https://fr-en.openfoodfacts.org/data


**You can use a preprocessed version of the dataset [here](https://gist.githubusercontent.com/AlexandreGazagnes/2000e5c0e9149ffdb8c682a751ac448a/raw/35ad83320c26155415b7cccff8a4150ee80ee501/FAO_Unilassalle_raw.csv).** (Best option)



### Mission

Our job is to : 
* Prepare notebook environment
* Load data
* Explore data
* Clean data ==> Select relevant data
* Clean data ==> Handle missing values
* Clean data ==> Handle duplicates ? 
* Clean data ==> Handle outliers ?
* Perform some basic analysis and data inspection
* Perform some basic visualisation
* Export our data

### Usefull Ressources on PCA

- About ACP
    - https://www.youtube.com/
    - https://www.youtube.com/
    - https://www.youtube.com/
    - https://www.youtube.com/watch?v=HMOI_lkzW08
    - https://www.youtube.com/watch?v=FgakZw6K1QQ
    - https://www.youtube.com/watch?v=0Jp4gsfOLMs&list=PLblh5JKOoLUJJpBNfk8_YadPwDTO2SCbx
    - https://www.youtube.com/watch?v=oRvgq966yZg
    - https://www.youtube.com/watch?v=FgakZw6K1QQ&list=PLblh5JKOoLUIcdlgu78MnlATeyx4cEVeR
    - https://www.youtube.com/watch?v=_UVHneBUBW0
    - https://www.youtube.com/watch?v=KrNbyM925wI&list=PLnZgp6epRBbRn3FeMdaQgVsFh9Kl0fjqX
    - https://www.youtube.com/watch?v=2UFiMvXvdZ4
    - THE BEST ONE  : https://www.youtube.com/watch?v=VdpNEjStT5g


### Teacher 

- More info : 
    - https://www.linkedin.com/in/alexandregazagnes/
    - https://github.com/AlexandreGazagnes
    

## Preliminaries

### System

In [1]:
# pwd

In [2]:
# cd ..

In [3]:
# ls

In [4]:
# cd ..

In [None]:
# ls

In [None]:
# !pip install -r requirements.txt

In [None]:
# !pip install pandas matplotlib seaborn plotly scikit-learn

In [None]:
# If you want to download the data from the web, please uncomment the following lines

# !wget https://gist.githubusercontent.com/AlexandreGazagnes/2000e5c0e9149ffdb8c682a751ac448a/raw/35ad83320c26155415b7cccff8a4150ee80ee501/FAO_Unilassalle_raw.csv

### Imports

In [None]:
# Imports

import numpy as np
import pandas as pd


import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# from sklearn.datasets import load_iris

### Data

In [None]:
# url
url = "https://gist.githubusercontent.com/AlexandreGazagnes/2000e5c0e9149ffdb8c682a751ac448a/raw/35ad83320c26155415b7cccff8a4150ee80ee501/FAO_Unilassalle_raw.csv"
url

In [None]:
# Read data
df = pd.read_csv(url, encoding='latin1')
df

In [None]:
# or

# data = load_iris()
# df = pd.DataFrame(data.data, columns=data.feature_names)
# df["Species"] = data.target
# df.head()

In [None]:
# or 

# fn = "./data/source/FAO.csv"
# df = pd.read_csv(fn, encoding='latin1') 

## Data Exploration

### Display

In [None]:
# head

df.head()

In [None]:
# tail
df.tail(10)

In [None]:
# sample 10
df.sample(10)

In [None]:
# sample frac
df.sample(frac=0.1)

### Structure

In [None]:
# shape
df.shape

In [None]:
# dtypes
df.dtypes

In [None]:
# count?
df.dtypes.value_counts()

In [None]:
# select ?
df.select_dtypes(include='object').head()

In [None]:
# nunique int ? 

df.select_dtypes(include="object").nunique()

In [None]:
# nunique float? 

# df.select_dtypes(include=float).nunique()

### Select data

In [None]:
# columns ? 
df.columns

In [None]:
columns = ['Area Abbreviation', 'Area Code', 'Area', 'Item Code', 'Item', 'Element Code', 'Element', 'Unit', 'latitude', 'longitude', 'Y2010', 'Y2011', 'Y2012', 'Y2013']
columns

In [None]:
# loc ? => JUST THE OUTPUT
df.loc[ :, columns].head()

In [None]:
# loc ? => REWRITE the DF
df = df.loc[:, columns]
df.sample(10)

In [None]:
# iloc ? 

df.iloc[:3, :3]

In [None]:
# head
df.head()

In [None]:
# columns ?
df.columns

In [None]:
# Creating a list of column with code 

columns = ["Area Code", "Item Code", "Element Code"]
columns

In [None]:
# Same but better  ! 
columns = []
for col in df.columns:
    if "Code" in col : 
        columns.append(col)

In [None]:
# Output columns
columns

In [None]:
# If needed : 
column_list = ["Area Code", "Item Code", "Element Code"]
column_list

In [None]:
# Drop columns
df.drop(columns=columns).head()

In [None]:
df

In [None]:
# drop columns
df.drop(index=[0,1,2]).head()

In [None]:
# Drop with errors="ignore"

df = df.drop(columns=columns, errors="ignore")
df.head()

In [None]:
# Implenting iloc

df.iloc[:, 1:].head()

In [None]:
# Saving our df 

df = df.iloc[:, 1:]
df.head()

In [None]:
# Just a specific column
df.Item.head()

In [None]:
# Just a specific column
df.loc[:, "Item"].head()

In [None]:
# Item unique ?
df.Item.sort_values().unique()

In [None]:
# Meat in Item unique ?
"Meat" in df.Item.unique()

In [None]:
# Select meat items
meat_items = []

for item in df.Item.unique():
    if "Meat" in item:
        meat_items.append(item)

meat_items

In [None]:
# Creating a selector True / False
selector = (df.Item == "Bovine Meat").tolist()
selector[:10]

In [None]:
# More advanced selection
df.loc[selector, :  ].head()

In [None]:
# More advanced selection
df = df.loc[df.Item == "Bovine Meat"]
df.head()

In [None]:
# Area?
df.Area.unique()[:10]

In [None]:
# Area nunique ?
df.Area.nunique()

In [None]:
# Item nunique ?
df.Item.nunique()

In [None]:
# Unit unique ?
df.Unit.nunique()

In [None]:
# Drop other useless columns

columns = ["Item",	"Element",	"Unit",	"latitude",	"longitude",]

df = df.drop(columns=columns, errors="ignore")
df

### NaN

In [None]:
# Nan Values
df.isna().head()

In [None]:
# Sum of Nan Values
df.isna().sum()

In [None]:
# Select Nan Values
df.loc[df.Y2010.isna(), :]

In [None]:
# Other selection
df.loc[df.Area =="Sudan", :]

In [None]:
# Drop a specific row
df.loc[df.Area != "Sudan", :].head()

In [None]:
# Drop a specific row
df = df.loc[df.Area != "Sudan", :]

df.head()

In [None]:
# Are we done ?
df.isna().sum()

In [None]:
# Useless but fun
df.isna().sum().sum()

In [None]:
# Output df
df

### Data Inspection

In [None]:
# Describe
df.describe()

In [None]:
# Better describe ?
df.describe().round(2)

In [None]:
# Recast as int
df.describe().astype(int)

In [None]:
# Sort by values
df.sort_values(by="Y2010").head(20)

In [None]:
# Select small values
df.loc[df.Y2010 < 5, :]

In [None]:
# Select small values and sort 
df.loc[df.Y2010 < 5, :].sort_values(by="Y2010")

In [None]:
# select 'big' values ==> drop lower values
df  = df.loc[df.Y2010 > 5, :]
df.head()

In [None]:
# sort by values top : 
df.sort_values(by="Y2010", ascending=False).head(20)

In [None]:
# Are we good ? 
df.sort_values(by="Y2010", ascending=True).head(20)

In [None]:
# Just to be sure : 
df.select_dtypes(include="number").head()

In [None]:
# Creating tmp variable, just with numeric values

tmp = df.select_dtypes(include="number")

In [None]:
# Correlation matrix is non sens here
# (sorry for that 😅)

corr = tmp.corr()
corr.round(4)

In [None]:
# Heatmap ? 
sns.heatmap(corr, annot=True)

In [None]:
# Better heatmap ?
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt='.4f', vmin=0, vmax=1)

In [None]:
# Best heatmap ever done ?
mask = np.triu(corr)
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".4f", vmin=-1, vmax=1, mask=mask)

In [None]:
# Build your first function

def corr_heatmap(df):
    tmp = df.select_dtypes(include="number")
    corr = tmp.corr()
    mask = np.triu(corr)
    sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".4f", vmin=-1, vmax=1, mask=mask)

In [None]:
# Use this function
corr_heatmap(df)

### Visualisation

In [None]:
# Just to be sure
df.sort_values("Y2010", ascending=False).head(20)

In [None]:
# Just to be sure
df.sort_values("Y2010", ascending=False).tail(20)

In [None]:
# Distplot
sns.displot(df.Y2010, kde=True)

In [None]:
# Distplot normal
sns.displot(np.random.normal(size=10000), kde=True, bins=100)

In [None]:
# What about skewness ?
df.Y2010.skew()

In [None]:
# What about kurtosis ?
df.Y2010.kurtosis()

In [None]:
# Log1p ? 
log_Y2010 = np.log1p(df.Y2010)
sns.displot(log_Y2010, kde=True)

In [None]:
# Top 5
top_5  = df.sort_values("Y2010", ascending=False).head(5)
top_5

In [None]:
# Bar plot 
sns.barplot(data=top_5, x="Area", y="Y2010")

In [None]:
# Same but better
px.bar(data_frame=top_5, x="Area", y="Y2010")

In [None]:
# My favorite plot
sns.boxplot(data=df.Y2010)

In [None]:
# Ok, this one
sns.boxplot(data=np.log1p(df.Y2010))

In [None]:
# Just another df output
df

In [None]:
# Melt ?

melt = pd.melt(df, id_vars=["Area"], value_vars=["Y2010", "Y2011", "Y2012", "Y2013"])
melt

In [None]:
# Boxplot
sns.boxplot(data=melt, x="variable", y="value")

In [None]:
# Line plot 
px.line(data_frame=melt, x="variable", y="value", color="Area")

In [None]:
# Melt 

melt = pd.melt(top_5, id_vars=["Area"], value_vars=["Y2010", "Y2011", "Y2012", "Y2013"])
px.line(data_frame=melt, x="variable", y="value", color="Area")

## Export

In [None]:
# Export Csv
df.to_csv("data.csv", index=False)