# Paris' trees analysis
## Objective:

Define a better inspection route for gardeners to take care of the city's trees

With the identification of types and caracteristics, routes could be planned to minimize mouvements and optimize the work force.
## Caracteristics:

What makes two trees alike / different ? Size ? Height ? Health ? Location ?

Created a new venv, and installed libraries using the shell
$ pip install pandas numpy matplotlib seaborn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('p2-arbres-fr.csv', sep = ";")

In [None]:
data.head(5)

In [None]:
data.describe()

Already we can observe that continuous features "circonference_cm" and "hauteur_m" have abnormal max values, which will need cleaning because that falsifies std and mean

In [None]:
data.info()

## First glance :
### Structure:

    18 columns
    200137 lines

Now to print a detail of each column

In [None]:
max_value=200137

for col in data:
    unique_counter = data[col].nunique()
    if unique_counter == max_value:
        print({col}, "\n" ,"Only unique values","\n")
    else:
        count = data[col].value_counts()
        freq = data[col].value_counts(normalize=True)

        dat = {"Count" :count , "Freq":freq}

        df = pd.DataFrame(dat)
        print({col},"\n" ,"Unique_counter: ", unique_counter, "\n" ,df,"\n")


## Inspection of each column:

1.   "id" => tree identifier => to be kept as is
2.   "type_emplacement" => 1 unique => no information on trees => to be *removed*
3.   "domanialite" => 90% of trees in 3 categories => to be **modified**
4.   "arrondissement" => various data => relevant to organize => to be kept as is
5. "complement_addresse" => can't be read by humans => to be *removed*
6. "numero" => 0 non-null => empty column => to be *removed*
7. "lieu" => incoherent => some have added information and others don't => to be **modified**
8. "id_emplacement" => can't be read by humans => to be *removed*
9. "libelle_francais" => not sure if important => to be kept as is
10. "genre" => scientific classification => to be kept as is
11. "espece" => scientific classification => to be kept as is
12. "variete" => scientific classification => to be kept as is
13. "circonference_cm" => abnormal data => to be **modified**
14. "hauteur_m" => abnormal data => to be **modified**
15. "stade_developpement" => hardly readable => to be **modified**
16. "remarquable" => 0 / 1 => needs to be translated into "yes" / "no" => to be **modified**
17. "geo_point_2d_a/geo_point_2d_b" => coordinates => to be kept as is

### Columns to be kept:
1. "id"
2. "arrondissement"
3. "libelle_francais"
4. "genre"
5. "espece"
6. "variete"
7. "geo_point_2d_a"
8. "geo_point_2d_b"

### Columns to be modified:
1. "domanialite"
2. "lieu"
3. "circonference_cm"
4. "hauteur_m"
5. "stade_developpement"
6. "remarquable"


### Irrelevant columns:
1. "type_emplacement"
2. "complement_addresse"
3. "numero"
4. "id_emplacement"

## Cleaning

### 1.   Create new dataframe without irrelevant columns

In [None]:
clearer_data = data.drop({"type_emplacement",
                          "complement_addresse",
                          "numero",
                          "id_emplacement"},
                         axis=1)
clearer_data.head()

### 2.   Clean and modify columns that need rearrangement

In [None]:
#replace 0 and 1 with "No" and "Yes" in "remarquable"
clearer_data["remarquable"].replace({0: "No",1:"Yes"}, inplace=True)

#replace 'A', 'J', 'M', 'JA',  and  with
#'Adulte', 'Jeune', 'Mature', 'Jeune Adulte' in "stade_developpement"
clearer_data["stade_developpement"].replace({"J": "Jeune",
                                             "JA":"Jeune Adulte",
                                             "A": "Adulte",
                                             "M": "Mature"},
                                            inplace=True)

#prepare a list of columns which NaNs can easily be filled
to_modify = []
for col in clearer_data:
  miss_norm = clearer_data[col].isna().value_counts(normalize=True)
  if True in miss_norm.index:
    miss_percent = round(miss_norm.get(True, 0)*100,5)
    if miss_percent < 10:
      to_modify.append(col)

#modifying columns
for i in to_modify:
  clearer_data[i] = clearer_data[i].fillna("Other")

## Missing values

In [None]:
# Percentage of missing values in each columns
for col in clearer_data:
  miss_norm = clearer_data[col].isna().value_counts(normalize=True)
  if True in miss_norm.index:
    miss_values = clearer_data[col].isna().value_counts()
    miss_percent = round(miss_norm.get(True, 0)*100,5)
    print(f"{col}","\n",miss_percent,"% Missing values","\n",
          "Total : ",miss_values.get(True, 0),"\n")

### Inferences possibilities

* "stade_developpement" and "remarquable" 34%, and 32% missing values.
* "variete" has 82% missing values.

We can't infer values statistically, as these are categorical features.

1.   "variete": try to infer depending on "genre" and "espece" ? same genre and same specie may imply same variete

=> after research, knowing variety could be irrelevant to trees maintenance

2.   "stade_developpement": try to infer depending on "espece", "hauteur_cm" and "circonference_cm" ? a certain height and circumference may imply a growth stage for a certain tree specie

=> in any case,

3.   "remarquable": can't be infered at all