# 0101 - First Session With Python - Training Notebook

* Written by Alexandre Gazagnes
* Last update: 2024-02-01

## About 

### Using Jupyter

You have 3 options: 
- Locally: 

    - **Install Anaconda https://www.anaconda.com/ or Jupyter https://jupyter.org/install on your machine**

    - Use Anaconda or Jupyter installed on the Unilasalle PC (**Warning ⚠️**: some packages may be missing) 


- Online:

    - **Use Google Colab https://colab.research.google.com/** (you have to be connected to your google account)

    - **Open this notebook on Google colab** : https://github.com/AlexandreGazagnes/Unilassalle-Public-Ressources/blob/main/4a-data-analysis/01-session/0101-training-notebook.ipynb
        * Badge : <a target="_blank" href="https://colab.research.google.com/github/AlexandreGazagnes/Unilassalle-Public-Ressources/blob/main/4a-data-analysis/01-session/0101-training-notebook.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

    - Use Jupyter online  https://jupyter.org/try-jupyter (**Warning ⚠️**: External packages cannot be installed) 


### Material

All the material for this course could be found here.
- https://github.com/AlexandreGazagnes/Unilassalle-Public-Ressources/tree/main/4a-data-analysis

### Python / Jupyter ? 

Few Questions : 
- Why Python
- Python vs R ? 
- What is Data Analysis ? 
- What are we talking about ? 
- What is Jupyter ?

### Context

You are a new employee of the NPO named "NPO".

You are in charged of data analysis.

First project is about GHG emissions, more precisely regarding Bovine Meat.

### Data

After a quick look on the internet, you find a very interesting dataset on the FAO website. It contains a list of various indicators. You decide to use this dataset to identify segments of countries.

- Find relevant data : 
    - https://www.kaggle.com/datasets/unitednations/global-food-agriculture-statistics
    - https://www.kaggle.com/datasets/dorbicycle/world-foodfeed-production
    - https://www.fao.org/faostat/en/
    - https://fr-en.openfoodfacts.org/
    - https://fr-en.openfoodfacts.org/data


**You can use a preprocessed version of the dataset [here](https://gist.githubusercontent.com/AlexandreGazagnes/2000e5c0e9149ffdb8c682a751ac448a/raw/35ad83320c26155415b7cccff8a4150ee80ee501/FAO_Unilassalle_raw.csv).** (Best option)



### Mission

Our job is to : 
* Prepare notebook environment
* Load data
* Explore data
* Clean data ==> Select relevant data
* Clean data ==> Handle missing values
* Clean data ==> Handle duplicates ? 
* Clean data ==> Handle outliers ?
* Perform some basic analysis and data inspection
* Perform some basic visualisation
* Export our data

### Usefull Ressources on PCA

- About ACP
    - https://www.youtube.com/
    - https://www.youtube.com/
    - https://www.youtube.com/
    - https://www.youtube.com/watch?v=HMOI_lkzW08
    - https://www.youtube.com/watch?v=FgakZw6K1QQ
    - https://www.youtube.com/watch?v=0Jp4gsfOLMs&list=PLblh5JKOoLUJJpBNfk8_YadPwDTO2SCbx
    - https://www.youtube.com/watch?v=oRvgq966yZg
    - https://www.youtube.com/watch?v=FgakZw6K1QQ&list=PLblh5JKOoLUIcdlgu78MnlATeyx4cEVeR
    - https://www.youtube.com/watch?v=_UVHneBUBW0
    - https://www.youtube.com/watch?v=KrNbyM925wI&list=PLnZgp6epRBbRn3FeMdaQgVsFh9Kl0fjqX
    - https://www.youtube.com/watch?v=2UFiMvXvdZ4
    - THE BEST ONE  : https://www.youtube.com/watch?v=VdpNEjStT5g


### Teacher 

- More info : 
    - https://www.linkedin.com/in/alexandregazagnes/
    - https://github.com/AlexandreGazagnes
    

## Preliminaries

### System

In [None]:
# pwd

In [None]:
# cd ..

In [None]:
# ls

In [None]:
# cd ..

In [None]:
# ls

In [None]:
# !pip install -r requirements.txt

In [None]:
# !pip install pandas matplotlib seaborn plotly scikit-learn

In [None]:
# If you want to download the data from the web, please uncomment the following lines

!wget https://gist.githubusercontent.com/AlexandreGazagnes/2000e5c0e9149ffdb8c682a751ac448a/raw/35ad83320c26155415b7cccff8a4150ee80ee501/FAO_Unilassalle_raw.csv

### Imports

In [None]:
# Imports

import numpy as np
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# from sklearn.datasets import load_iris

### Data

In [None]:
# url
url = "https://gist.githubusercontent.com/AlexandreGazagnes/2000e5c0e9149ffdb8c682a751ac448a/raw/35ad83320c26155415b7cccff8a4150ee80ee501/FAO_Unilassalle_raw.csv"
url

In [None]:
# Read data
df = pd.read_csv(url, encoding="latin1")
df

In [None]:
# or

# data = load_iris()
# df = pd.DataFrame(data.data, columns=data.feature_names)
# df["Species"] = data.target
# df.head()

In [None]:
# or

# fn = "./data/source/FAO.csv"
# df = pd.read_csv(fn, encoding='latin1')

## Data Exploration

### Display

In [None]:
# head

In [None]:
# tail

In [None]:
# sample 10

In [None]:
# sample frac

### Structure

In [None]:
# shape

In [None]:
# dtypes

In [None]:
# count?

In [None]:
# select ?

In [None]:
# nunique int ?

In [None]:
# nunique float?

### Select data

In [None]:
# columns ?

In [None]:
columns = [
    "Area Abbreviation",
    "Area Code",
    "Area",
    "Item Code",
    "Item",
    "Element Code",
    "Element",
    "Unit",
    "latitude",
    "longitude",
    "Y2010",
    "Y2011",
    "Y2012",
    "Y2013",
]
columns

In [None]:
# loc ? => JUST THE OUTPUT

In [None]:
# loc ? => REWRITE the DF

In [None]:
# iloc ?

In [None]:
# head

In [None]:
# columns ?

In [None]:
# Creating a list of column with code

columns = ["Area Code", "Item Code", "Element Code"]
columns

In [None]:
# Same but better  !

In [None]:
# Output columns

In [None]:
# If needed :
column_list = ["Area Code", "Item Code", "Element Code"]
column_list

In [None]:
# Drop columns

In [None]:
# drop columns

In [None]:
# Drop with errors="ignore"

In [None]:
# Implenting iloc

In [None]:
# Saving our df

In [None]:
# Just a specific column

In [None]:
# Just a specific column

In [None]:
# Item unique ?

In [None]:
# Meat in Item unique ?

In [None]:
# Select meat items

In [None]:
# Creating a selector True / False

In [None]:
# More advanced selection

In [None]:
# More advanced selection

In [None]:
# Area?

In [None]:
# Area nunique ?

In [None]:
# Item nunique ?

In [None]:
# Unit unique ?

In [None]:
# Drop other useless columns

columns = [
    "Item",
    "Element",
    "Unit",
    "latitude",
    "longitude",
]

### NaN

In [None]:
# Nan Values

In [None]:
# Sum of Nan Values

In [None]:
# Select Nan Values

In [None]:
# Other selection

In [None]:
# Drop a specific row

In [None]:
# Drop a specific row

In [None]:
# Are we done ?

In [None]:
# Useless but fun

In [None]:
# Output df

### Data Inspection

In [None]:
# Describe

In [None]:
# Better describe ?

In [None]:
# Recast as int

In [None]:
# Sort by values

In [None]:
# Select small values

In [None]:
# Select small values and sort

In [None]:
# select 'big' values ==> drop lower values

In [None]:
# sort by values top :

In [None]:
# Are we good ?

In [None]:
# Just to be sure :

In [None]:
# Creating tmp variable, just with numeric values

In [None]:
# Correlation matrix is non sens here
# (sorry for that 😅)

In [None]:
# Heatmap ?

In [None]:
# Better heatmap ?

In [None]:
# Best heatmap ever done ?

In [None]:
# Build your first function


def corr_heatmap(df):
    tmp = df.select_dtypes(include="number")
    corr = tmp.corr()
    mask = np.triu(corr)
    sns.heatmap(
        corr, annot=True, cmap="coolwarm", fmt=".4f", vmin=-1, vmax=1, mask=mask
    )

In [None]:
# Use this function

### Visualisation

In [None]:
# Just to be sure

In [None]:
# Just to be sure

In [None]:
# Distplot

In [None]:
# Distplot normal

In [None]:
# What about skewness ?

In [None]:
# What about kurtosis ?

In [None]:
# Log1p ?

In [None]:
# Top 5

In [None]:
# Bar plot

In [None]:
# Same but better

In [None]:
# My favorite plot

In [None]:
# Ok, this one

In [None]:
# Just another df output

In [None]:
# Melt ?

In [None]:
# Boxplot

In [None]:
# Line plot

In [None]:
# Melt

## Export

In [None]:
# Export Csv