# 0101 - First Session With Python - Solution Notebook

* Written by Alexandre Gazagnes
* Last update: 2024-02-01

## About 

### Using Jupyter

You have 2 options: 
- Locally: 

    - **Install Anaconda https://www.anaconda.com/ or Jupyter https://jupyter.org/install on your machine**

    - Use Anaconda or Jupyter installed on the Unilasalle PC (**Warning ⚠️**: some packages may be missing) 


- Online:

    - **Use Google Colab https://colab.research.google.com/** (you have to be connected to your google account)

    - **Open this notebook on Google colab URL**
        * Badge

    - Use Jupyter online  https://jupyter.org/try-jupyter (**Warning ⚠️**: External packages cannot be installed) 


### Material

All the material for this course could be found here.
- https://github.com/AlexandreGazagnes/Unilassalle-Public-Ressources/tree/main/4a-data-analysis

### Python / Jupyter ? 

Few Questions : 
- Why Python
- Python vs R ? 
- What is Data Analysis ? 
- What are we talking about ? 
- What is Jupyter ?

### Context

You are a new employee of the NPO named "NPO".

You are in charged of data analysis.

First project is about GHG emissions, more precisely regarding Bovine Meat.

### Data

After a quick look on the internet, you find a very interesting dataset on the FAO website. It contains a list of various indicators. You decide to use this dataset to identify segments of countries.

- Find relevant data : 
    - https://www.kaggle.com/datasets/unitednations/global-food-agriculture-statistics
    - https://www.kaggle.com/datasets/dorbicycle/world-foodfeed-production
    - https://www.fao.org/faostat/en/
    - https://fr-en.openfoodfacts.org/
    - https://fr-en.openfoodfacts.org/data


**You can use a preprocessed version of the dataset [here](https://gist.githubusercontent.com/AlexandreGazagnes/2000e5c0e9149ffdb8c682a751ac448a/raw/35ad83320c26155415b7cccff8a4150ee80ee501/FAO_Unilassalle_raw.csv).** (Best option)



### Mission


Our job is to : 
* Prepare notebook environment
* Load data
* Explore data
* Clean data ==> Select relevant data
* Clean data ==> Handle missing values
* Clean data ==> Handle duplicates ? 
* Clean data ==> Handle outliers ?
* Perform some basic analysis and data inspection
* Perform some basic visualisation
* Export our data

### Usefull Ressources about Google Colab


- On Youtube : 
    - https://www.youtube.com/watch?v=8KeJZBZGtYo
    - https://www.youtube.com/watch?v=JJYZ3OE_lGo
    - https://www.youtube.com/watch?v=tCVXoTV12dE

### Usefull Ressources about Anaconda and Jupyter


- On Youtube : 
    - https://www.youtube.com/watch?v=ovlID7gefzE
    - https://www.youtube.com/watch?v=IMrxB8Mq5KU
    - https://www.youtube.com/watch?v=Ou-7G9VQugg
    - https://www.youtube.com/watch?v=5pf0_bpNbkw


### Teacher 

- More info : 
    - https://www.linkedin.com/in/alexandregazagnes/
    - https://github.com/AlexandreGazagnes
    

## Preliminaries

### System

These commands will display the system information:

Uncomment theses lines if needed. 

In [None]:
# pwd

In [None]:
# cd ..

In [None]:
# ls

In [None]:
# cd ..

In [None]:
# ls

These commands will install the required packages:

**Please note that if you are using google colab, all you need is already installed**

In [None]:
# !pip install pandas matplotlib seaborn plotly scikit-learn

This command will download the dataset:

**Please note that we will download the dataset later, in this notebook**

In [None]:
# !wget https://gist.githubusercontent.com/AlexandreGazagnes/2000e5c0e9149ffdb8c682a751ac448a/raw/35ad83320c26155415b7cccff8a4150ee80ee501/FAO_Unilassalle_raw.csv

### Imports

Import data libraries:

In [None]:
import pandas as pd  # DataFrame
import numpy as np  # Matrix and advanced maths operations

Import Graphical libraries:

In [None]:
import matplotlib.pyplot as plt  # Visualisation
import seaborn as sns  # Visualisation
import plotly.express as px  # Visualisation (not used here)

:warning:**These imports must be done, it is not possible to use this notebook without pandas, matplotlib etc.**

### Data

1st option : Download the dataset from the web

In [None]:
# url
url = "https://gist.githubusercontent.com/AlexandreGazagnes/2000e5c0e9149ffdb8c682a751ac448a/raw/35ad83320c26155415b7cccff8a4150ee80ee501/FAO_Unilassalle_raw.csv"
url

Read the data : 

In [None]:
df = pd.read_csv(url, encoding="latin1")
df.head()

2nd Option : Read data from a file

In [None]:
# # or

# fn = "my/awsome/respository/my_awsome_file.csv"
# fn = "./data/source/FAO.csv"
# df = pd.read_csv(fn, encoding='latin1')

## Data Exploration

### Display

Display the first rows of the dataset:

In [None]:
# head

df.head()

Display the last rows of the dataset:

In [None]:
# tail

df.tail(10)

Display a sample of the dataset:

In [None]:
# sample 10

df.sample(10)

In [None]:
# Sample with just 10% of the dataset

df.sample(frac=0.1)

### Structure

What is the shape of the dataset?

In [None]:
# shape

What data types are present in the dataset?

In [None]:
# dtypes

:warning: 
**Please note that we have here main python dtypes**
Data types : 
- int : *Integer* : 1,2,12332, 1_000_000
- float : *Float* : 1.243453, 198776.8789, 1.9776
- object : In this example object stands for *String* : "Paris", "Rouen", "Lea" 

Count the number of columns with specific data types:

In [None]:
# value_counts

Select only string columns:

In [None]:
# select_dtypes

Counting unique values for string columns : 

In [None]:
# nunique

### Select data

Display all the columns : 

In [None]:
# columns

Just use a small number of columns : 

In [None]:
columns = [
    "Area Abbreviation",
    "Area Code",
    "Area",
    "Item Code",
    "Item",
    "Element Code",
    "Element",
    "Unit",
    "latitude",
    "longitude",
    "Y2010",
    "Y2011",
    "Y2012",
    "Y2013",
]
columns

Make your column selection and display the output : 

In [None]:
# loc ? => JUST THE OUTPUT

If this Transformation is OK, you can re-write your ```df``` variable : 

In [None]:
# loc ? => REWRITE the DF

Use ```iloc``` to select the nth line and the mth column : 

In [None]:
# iloc

Use ```iloc``` to select data from 1st to the nth line and from first to the mth column : 

In [None]:
# iloc

Just keep in mind the global shape of our dataset : 

And the names of our columns :

Columns with the *code* key word are not relevant : 

In [None]:
columns = ["Area Code", "Item Code", "Element Code"]
columns

Suppose we have 1_000 columns ...

Let's find a more *pythonic* way to extract the *code* columns : 

:clap: We have used : 
- a ```list``` : ```columns = [] ``` 
- a ```for``` loop
- an ```if``` statement 

What is the value of the ```columns``` variable ?

Let's drop these columns : 

In [None]:
# drop columns

Rewrite our dataframe 

In [None]:
# drop indexes

In [None]:
# Drop with errors="ignore"

Another usage of iloc : 

In [None]:
# Implenting iloc

So far so good, we can rewrite our ```df```

In [None]:
# Saving our df

Selecting a specific column : 

In [None]:
# 1st implementation

In [None]:
# 2nd implementation

Can we have a good representation of each unique value for the ```Item``` column ?

In [None]:
# Item unique ?

Is ```meat``` in our Item column ?

In [None]:
# Meat in Item unique ?

Use a list, a for loop and an if statement to be sure to have all items with ```Meat``` : 

In [None]:
# Select meat items

Build a boolean selector : 

In [None]:
# Creating a selector True / False

Select relevant data with the ```loc``` method : 

In [None]:
# .loc

Try a more advanced selection : 

In [None]:
# More advanced selection

What about Area ?

In [None]:
# Area?

And area number of unique values ? 

In [None]:
# Area nunique ?

Same for Item : 

In [None]:
# Item nunique ?

Same for Unit : 

In [None]:
# Unit unique ?

Drop uselss columns : 

In [None]:
# Drop other useless columns

columns = [
    "Item",
    "Element",
    "Unit",
    "latitude",
    "longitude",
]

### NaN Values

Lets have a look to NaN (Not a Number) aka missing values : 

In [None]:
# Nan Values

Compute the sum of missing values for each line : 

In [None]:
# Sum of Nan Values

Try to focus on a specifc column: 

In [None]:
# Select Nan Values

Try to focus on a specific Country :

In [None]:
# Other selection

Drop Sudan from our DataFrame : 

In [None]:
# Drop a specific row

In [None]:
# Drop a specific row

Are we done ?


Useless but fun : 

Final output of ```df``` :


### Data Inspection

In [None]:
# Describe

In [None]:
# Better describe ?

In [None]:
# Recast as int

In [None]:
# Sort by values

In [None]:
# Select small values

In [None]:
# Select small values and sort

In [None]:
# select 'big' values ==> drop lower values

In [None]:
# sort by values top :

In [None]:
# Are we good ?

In [None]:
# Just to be sure :

In [None]:
# Creating tmp variable, just with numeric values

In [None]:
# Correlation matrix is non sens here
# (sorry for that 😅)

In [None]:
# Heatmap ?

In [None]:
# Better heatmap ?

In [None]:
# Best heatmap ever done ?

In [None]:
# Build your first function

In [None]:
# Use this function

## Visualisation

### Distplot

In [None]:
# Just to be sure

In [None]:
# Just to be sure

In [None]:
# Distplot

In [None]:
# Distplot normal

In [None]:
# What about skewness ?

In [None]:
# What about kurtosis ?

In [None]:
# Log1p => log(x+1) ?

In [None]:
# Top 5

### Barplot

In [None]:
# Bar plot

In [None]:
# Same but better

### Boxplot

In [None]:
# My favorite plot EVER ;)

In [None]:
# Ok, this one

In [None]:
# Just another df output

### Lineplot

In [None]:
# Melt ?

In [None]:
# Boxplot

In [None]:
# Line plot

In [None]:
# Melt only top 5

## Export

Export the csv file : 