# Paris' trees analysis
## Objective:

Define a better inspection route for gardeners to take care of the city's trees

With the identification of types and caracteristics, routes could be planned to minimize mouvements and optimize the work force.
## Caracteristics:

What makes two trees alike / different ? Size ? Height ? Health ? Location ?

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
data = pd.read_csv('p2-arbres-fr.csv', sep = ";")

In [4]:
data.head(5)

Unnamed: 0,id,type_emplacement,domanialite,arrondissement,complement_addresse,numero,lieu,id_emplacement,libelle_francais,genre,espece,variete,circonference_cm,hauteur_m,stade_developpement,remarquable,geo_point_2d_a,geo_point_2d_b
0,99874,Arbre,Jardin,PARIS 7E ARRDT,,,MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E,19,Marronnier,Aesculus,hippocastanum,,20,5,,0.0,48.85762,2.320962
1,99875,Arbre,Jardin,PARIS 7E ARRDT,,,MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E,20,If,Taxus,baccata,,65,8,A,,48.857656,2.321031
2,99876,Arbre,Jardin,PARIS 7E ARRDT,,,MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E,21,If,Taxus,baccata,,90,10,A,,48.857705,2.321061
3,99877,Arbre,Jardin,PARIS 7E ARRDT,,,MAIRIE DU 7E 116 RUE DE GRENELLE PARIS 7E,22,Erable,Acer,negundo,,60,8,A,,48.857722,2.321006
4,99878,Arbre,Jardin,PARIS 17E ARRDT,,,PARC CLICHY-BATIGNOLLES-MARTIN LUTHER KING,000G0037,Arbre à miel,Tetradium,daniellii,,38,0,,,48.890435,2.315289


In [None]:
data.describe()

In [None]:
data.info()

## First glance :
### Structure:

    18 columns
    200137 lines

Now to print a detail of each column

In [None]:
max_value=200137

for col in data:
    unique_counter = data[col].nunique()
    if unique_counter == max_value:
        print("Only unique values")
    else:
        count = data[col].value_counts()
        freq = data[col].value_counts(normalize=True)

        dat = {"Count" :count , "Freq":freq}

        df = pd.DataFrame(dat)
        print({col},"Unique_counter: ", unique_counter, "\n" ,df)


## Inspection of each column:

    "id" => tree identifier => to be kept as is
    "type_emplacement" => 1 unique => no information on trees => to be removed
    "domanialite" => 90% of trees in 3 categories => to be modified
    "arrondissement" => various data => relevant to organize => to be kept as is
    "complement_addresse" => can't be read by humans => to be removed
    "numero" => 0 non-null => empty column => to be removed
    "lieu" => incoherent => some have added information and others don't => to be modified
    "id_emplacement" => can't be read by humans => to be removed
    "libelle_francais" => not sure if important => to be kept as is
    "genre" => scientific classification => to be kept as is
    "espece" => scientific classification => to be kept as is
    "variete" => scientific classification => to be kept as is
    "circonference_cm" => abnormal data => to be modified
    "hauteur_m" => abnormal data => to be modified
    "stade_developpement" => hardly readable => to be modified
    "remarquable" => 0 / 1 => needs to be translated into "yes" / "no" => to be modified
    "geo_point_2d_a/geo_point_2d_b" => coordinates => to be kept as is

### Columns to be kept:

    "id"
    "arrondissement"
    "libelle_francais"
    "genre"
    "espece"
    "variete"
    "geo_point_2d_a"
    "geo_point_2d_b"

### Columns to be modified:

    "domanialite"
    "lieu"
    "circonference_cm"
    "hauteur_m"
    "stade_developpement"
    "remarquable"

### Irrelevant columns:

    "id"
    "type_emplacement"
    "complement_addresse"
    "numero"
    "id_emplacement"



## Cleaning
1. Create new dataframe without irrelevant columns

In [None]:
clearer_data = data.drop({"type_emplacement",
                          "complement_addresse",
                          "numero",
                          "id_emplacement"},
                         axis=1)
clearer_data.head()

2. Replacing "remarquable" and "stade_developpement" values with human readable content

In [None]:
#replace 0 and 1 with "No" and "Yes" in "remarquable"
clearer_data["remarquable"].replace({0: "No",1:"Yes"}, inplace=True)

#replace 'A', 'J', 'M', 'JA',  and  with
#'Adulte', 'Jeune', 'Mature', 'Jeune Adulte' in "stade_developpement"
clearer_data["stade_developpement"].replace({"J": "Jeune",
                                             "JA":"Jeune Adulte",
                                             "A": "Adulte",
                                             "M": "Mature"},
                                            inplace=True)

## Missing values

1. Exploring missing values and their depth

In [None]:
for col in clearer_data:
  miss_norm = clearer_data[col].isna().value_counts(normalize=True)
  if miss_norm[0]<1:
    miss_values = clearer_data[col].isna().value_counts()
    miss_percent = round(miss_norm[1]*100,5)
    print(f"{col}","\n",miss_percent,"% Missing values","\n",
          "Total : ",miss_values[1],"\n")



1. Unique

    "domanialite" has one missing value.

It can be ignored or filled with information. It should not be very time consuming either way.

2. Very low

    "libelle_francais", "genre", "espece" each have less than 1% missing values.

They can be ignored, as filling wouldn't be easy, due to the amount of data to gather, and poor relevance to trees maintenance

3. Medium and High

    "variete", "stade_developpement" and "remarquable" 18%, 34%, and 32% missing values.

We can't infer values statistically, as these are quality features.


## Inferences possibilities

    "variete": try to infer depending on "genre" and "espece" ? same genre and same specie may imply same variete

=> after research, knowing variety could be irrelevant to trees maintenance

    "stade_developpement": try to infer depending on "espece", "hauteur_cm" and "circonference_cm" ? a certain height and circumference may imply a growth stage for a certain tree specie

=> in any case,

    "remarquable": can't be infered at all


