# Understanding data

## Setup
### Imports
In order to work with data, we need to import some libraries.

In [2]:
import pandas as pd                     # for dataset manipulation (DataFrames)
import sklearn.datasets                 # the datasets we are going to use
import numpy as np                      # allows some mathematical operations
import matplotlib.pyplot as plt         # library used to display graphs
import seaborn as sns                   # more convenient visualization library for dataframes

### Loading the dataset

In [None]:
iris = sklearn.datasets.load_iris()
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])

The dataset is now loaded into the `df` variable, which stands for "DataFrame".
DataFrames are objects proposed by the `pandas` library. They are basically convenient tables, with a lot of built-in functions to manipulate them.

You can see what the dataframe looks like by executing the cell below :

In [None]:
df

## Data understanding
### Data source and documentation
Before even writing any code, it is important to check where the data is coming from and gather as much information as possible on the data it contains.
The iris dataset - that we loaded above - is a popular dataset to teach machine learning, meaning information is easily accessible on the internet.


#### Questions
**Before beginning the data analysis, find the answers to the following questions:**
- Who created the dataset? When and why?
- Describe briefly what the iris dataset contains.
- What information does the columns contain?
- In particular, what is the `target` column, and what does its values correspond to?

*Hint: We use the `scikit-learn` library to load the dataset.*


#### Answers

- Who created the dataset? When and why? => Sir R.A. Fisher. 1936. To illustrate discriminant analysis. (https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-dataset)
- Describe briefly what the iris dataset contains. => The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.
- What information does the columns contain? => sepal length, sepal width, petal length, petal width, target
- In particular, what is the `target` column, and what does its values correspond to? => The target column is the class of the iris plant. It has 3 values : 0, 1, 2. Each value corresponds to a type of iris plant. 0 is Iris-Setosa, 1 is Iris-Versicolour, 2 is Iris-Virginica



### Getting general information about the dataset

#### Questions

1. How much data does the dataset contain?
2. How many features (columns) are there?
3. Name the different columns and their data types.
4. For each column, check the values of the following statistics: mean, standard deviation, minimum, maximum, and median.
5. How do these values vary within each type of iris? *(Use the code sample below as reference)*

*Hint: You will need to use the pandas functions: `DataFrame.shape`, `DataFrame.head()`, `DataFrame.describe()`, and `DataFrame.info()`. Make sure to [check their documentation](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html)!*

In [None]:
# Below is a code sample to show you how to filter a DataFrame
filtered_data = df[df["sepal length (cm)"] > 5]
filtered_data

In [None]:
# Your code here
# Question 1 et 2
df.shape #  1) 150 ligne, 5 colonne ou df.info()
# Question 3
df.info() # sepal length
# Question 4
df.describe()
# Question 5
filtered2 = df[df["target"] == 2]
filtered2.describe()
filtered1 = df[df["target"] == 1]
filtered1.describe()
filtered0 = df[df["target"] == 0]
filtered0.describe()

*[Your answers here]*

**Question 1:** 150

**Question 2:** 5

**Question 3:** sepal length (cm) float64, sepal width (cm) float64, petal length (cm) float64, petal width (cm) float64, target float64

**Question 4:** 
| Colonne | Mean | standard deviation | minimum | maximum | median |
| --- | --- | --- | --- | --- | --- |
| sepal length (cm) | 5.843333 | 0.828066 | 4.3 | 7.9 | 5.8 |
| sepal width (cm) | 3.057333 | 0.435866 | 2.0 | 4.4 | 3.0 |
| petal length (cm) | 3.758000 | 1.765298 | 1.0 | 6.9 | 4.35 |
| petal width (cm) | 1.199333 | 0.762238 | 0.1 | 2.5 | 1.3 |
| target | 1.000000 | 0.819232 | 0.0 | 2.0 | 1.0 |

**Question 5:** 

Target 0:

| Colonne | Mean | standard deviation | minimum | maximum | median |
| --- | --- | --- | --- | --- | --- |
| sepal length (cm) | 5.006 | 0.352490 | 4.3 | 5.8 | 5.0 |
| sepal width (cm) | 3.428 | 0.379064 | 2.3 | 4.4 | 3.4 |
| petal length (cm) | 1.462 | 0.173664 | 1.0 | 1.9 | 1.5 |
| petal width (cm) | 0.246 | 0.105386 | 0.1 | 0.6 | 0.2 |
| target | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 |

Target 1:
| Colonne | Mean | standard deviation | minimum | maximum | median |
| --- | --- | --- | --- | --- | --- |
| sepal length (cm) | 5.936 | 0.516171 | 4.9 | 7.0 | 5.9 |
| sepal width (cm) | 2.770 | 0.313798 | 2.0 | 3.4 | 2.8 |
| petal length (cm) | 4.260 | 0.469911 | 3.0 | 5.1 | 4.35 |
| petal width (cm) | 1.326 | 0.197753 | 1.0 | 1.8 | 1.3 |
| target | 1.000000 | 0.000000 | 1.0 | 1.0 | 1.0 |

Target 2:
| Colonne | Mean | standard deviation | minimum | maximum | median |
| --- | --- | --- | --- | --- | --- |
| sepal length (cm) | 6.588 | 0.635880 | 4.9 | 7.9 | 6.5 |
| sepal width (cm) | 2.974 | 0.322497 | 2.2 | 3.8 | 3.0 |
| petal length (cm) | 5.552 | 0.551895 | 4.5 | 6.9 | 5.55 |
| petal width (cm) | 2.026 | 0.274650 | 1.4 | 2.5 | 2.0 |
| target | 2.000000 | 0.000000 | 2.0 | 2.0 | 2.0 |

### Basic validity checks

In order to use a dataset for machine learning, we generally want to have "clean" data. Generally, we want to avoid missing and absurd values, duplicates, and imbalanced datasets.

#### Questions
1. How many rows contain missing data?
2. What does it mean for a dataset to be "balanced"? Do you think this dataset is balanced?
3. Is there any duplicated data in the dataset? In your opinion, is it good or bad for machine learning? Why?

*Hint: You will need to use the `value_counts()` and `duplicated()` functions.*

In [None]:
# Your code here
df.info()
df["target"].value_counts()
df.duplicated().value_counts()

*[Your answers here]*

Question 1 : 0
Question 2 : Si toutes les targets ont le meme nombre = balanced, sinon = imbalanced. Ici, le dataset est balanced.
Question 3 : Oui 1. Cela dépends selon le type de données, la, il y a 1 fleur identique dans le dataset ce qui n'est pas un problème.

### Making the data more convenient to use

You have probably noticed that the column names are quite long and also contain spaces, which is generally inconvenient in code. You can use the following code to change them:

In [None]:
df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', "species"]
df

**Bonus question**: How else can you rename columns in a dataframe?

In [None]:
# Your code here
df.rename(columns={ "A": "sepal_length", "B" : "sepal_width",  "C": "petal_length", "D": "petal_width", "E":"species"})
df

The `class_float` column is also a bit hard to read, because the classes are represented by numbers. This can be preferable for some algorithms, but for our use case today we will replace these values by explicit names. For this, we will use the `DataFrame.apply()` function.

In [None]:
# This is the function we will apply to the "class_float" column
def name_mapping(number:float):
    """This function maps 0.0, 1.0 and 2.0 to their corresponding values in the iris dataset."""

    name_map = {
        0.0: "setosa", # replace by the correct name
        1.0: "versicolour", # replace by the correct name
        2.0: "virginica", # replace by the correct name
    }

    if number not in name_map.keys(): # making sure the number is one of the expected values
        raise ValueError("Not a valid number!")

    return name_map[number] # This is an alternative to using a lot of if/else blocks

In [None]:
# We can now apply the function
df["species"] = df["species"].apply(name_mapping) # We pass the function as parameter, not its result! This is why we must not use parentheses.

In [None]:
# And check the result
df

## Data visualization

Data visualization will help us:
- Confirming and observing things we already know
- Learning new facts about the data

In this section **keep in mind that the goal of machine learning algorithms would be to classify the different species of iris**.

### Countplot
What does a `countplot` show? With what other function did you get similar results earlier?
Try experimenting with the parameters of the function.

In [None]:
sns.countplot(x='species', data=df)
plt.show()
# A countplot show multiples rectangles, we hqd similar results with df.valuecounts("target")
# It shows us that the data is balanced

### Boxplot
What does a `boxplot` show? With what other function did you get similar results earlier?
In your opinion, why would you need this type of graph?

In [None]:
# Utility function to simplify syntax later on
def boxplot(y):
    sns.boxplot(x="species", y=y, data=df)

# We define a figure where we will be adding our graphs
plt.figure(figsize=(15,10))

# And then add the plots to the grid on specific positions
plt.subplot(221)
boxplot('sepal_length')

plt.subplot(222)
boxplot('sepal_width')

plt.subplot(223)
boxplot('petal_length')

plt.subplot(224)
boxplot('petal_width')

plt.show()

# It shows the min and the max with whiskers and the quartiles(between 25% and 75%) in the form of a rectangle. We need it to show the data distribution
# sepal length order is = 1) Virginica 2) Versicolour 3) setosa
# sepal width order is = 1) setosa 2)Versicolour 3) Virginica
# petal length order is = 1) Virginica 2) Versicolour 3) Setosa
# petal width order is = 1) Virginica 2) Versicolour 3) Setosa

### Scatterplot
What does a `scatterplot` show?
What conclusions can you draw from this graph?
Try changing the inputs of the function. Does this change your observations? What new conclusions can you draw from this?

In [None]:
def scatter(xd , yd, hued, dfd):
	sns.scatterplot(x=xd, y=yd, hue=hued, data=dfd, )
	plt.legend(bbox_to_anchor=(1, 1), loc=2) # Displays the legend outside the graph
	plt.show()

scatter('sepal_length', 'sepal_width', 'species', df)
scatter('petal_length', 'petal_width', 'species' , df)
scatter('species', 'sepal_width', 'sepal_length', df)

# A scatter plot shows us the different semantic groupings. 
# Setose sepal = Most width / less length
# Setose petal = less width / less length
# Versicolour sepal = A little bit more width than virginica, but less length than Virginica
# Versicolour petal = between the Setose and Virginica, closer to virginica
# Virginica sepal = Most length / less width
# Virginica petal = Most Width / Most length

### Displot
What does a `displot` show?
In the documentation, find what the "kind" parameter does, and try all the kinds of plot.
Can you imagine a use for the kde plots?

In [None]:
sns.displot(df, x="petal_length", kind="kde", hue="species")


### Pairplot
What does a `pairplot` show?
Can you draw any new conclusions from it?
In your opinion, what could be the uses of such a graph?

In [None]:
sns.pairplot(df, hue='species', height=2)

### Histogram
Histograms will not teach us anything new here, but they can be another way of visualizing data.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(10,10))

axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df['sepal_length'], bins=17)

axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df['sepal_width'], bins=5)

axes[1,0].set_title("Petal Length")
axes[1,0].hist(df['petal_length'], bins=6)

axes[1,1].set_title("Petal Width")
axes[1,1].hist(df['petal_width'], bins=6);

### Correlation
Correlation shows the relative importance of variables between each other. It can be computed directly with `pandas`.

In [None]:
df.corr(method='pearson')

`heatmaps` make it easier to see which correlations are the most important.

In [None]:
sns.heatmap(df.corr(method='pearson'),annot = True)
plt.show()

### Covariance
In the same way, try displaying the covariance matrix with a `heatmap`!

*Hint: You can access the covariance matrix with the `cov()` function.*

In [None]:
# Your code here
sns.heatmap(df.cov(),annot = True)
plt.show()

## Try it yourself!
As an exercise, try analysing another dataset that you do no know. There are many datasets freely available on the internet. For example, try loading [another one of sklearn's toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html)!

In [15]:
# Your code here
digits = sklearn.datasets.load_breast_cancer()
daf = pd.DataFrame(data = digits['data'], columns = digits['feature_names'])
daf.shape
daf.info()
#sns.countplot(x='target', data=daf)
#sns.boxplot(x="target", y="pixel_0_0", data=daf)
#sns.scatterplot(x=xd, y=yd, hue=hued, data=dfd, )
#plt.legend(bbox_to_anchor=(1, 1), loc=2) # Displays the legend outside the graph
#plt.show()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5