# Paris Trees Dataset Analysis Exercises

This notebook contains exercises to practice working with Pandas using the Paris Trees dataset. Each exercise includes instructions and space for your solution.

First, let's import the required libraries and load our data:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('./../data/les_arbres_upload_1k.csv')
print("Dataset shape:", df.shape)
df.head()

Dataset shape: (999, 17)


Unnamed: 0,idbase,location_type,domain,arrondissement,suppl_address,number,address,id_location,name,genre,species,variety,circumference,height,stage,remarkable,geo_point_2d
0,249403,Arbre,Alignement,PARIS 20E ARRDT,54,,AVENUE GAMBETTA,1402008,Tilleul,Tilia,tomentosa,,85,10,Adulte,NON,"48.86685102642415, 2.400262189227641"
1,2020183,Arbre,Jardin,PARIS 17E ARRDT,Sortie des Charmes,,PARC CLICHY BATIGNOLLES MARTIN LUTHER KING / 1...,E00301002,Charme,Carpinus,betulus,,33,6,Jeune (arbre),NON,"48.89227416971153, 2.3138805655035446"
2,290961,Arbre,Alignement,PARIS 12E ARRDT,35,,AVENUE COURTELINE,503003,Robinier,Robinia,pseudoacacia,Bessoniana,25,5,Jeune (arbre),NON,"48.84403805860435, 2.4147874668129603"
3,2038819,Arbre,CIMETIERE,SEINE-SAINT-DENIS,,,CIMETIERE DE PANTIN / DIV 124,D00000124006,Arbre à soie,Albizia,sp.,,20,5,,NON,"48.90334546262646, 2.4130271915901162"
4,2042131,Arbre,Jardin,PARIS 18E ARRDT,,,PARC CHAPELLE CHARBON - 5Z RUE DE LA CROIX MOREAU,404003,Amélanchier,Amelanchier,lamarckii,,16,4,Jeune (arbre),NON,"48.896917788620975, 2.362766161184105"


## Exercise 1: Data Exploration and Null Values

1. Check for null values in each column using .isna()
2. Calculate the percentage of null values for each column
3. Identify columns with the highest number of missing values

In [2]:
# Check for null values
print("Number of null values per column:")
print(df.isna().sum())

# Calculate percentage of null values
print("\nPercentage of null values per column:")
print((df.isna().sum() / len(df) * 100).round(2))

# Identify columns with highest number of nulls
null_counts = df.isna().sum().sort_values(ascending=False)
print("\nTop 5 columns with most null values:")
print(null_counts.head())

Number of null values per column:
idbase              0
location_type       0
domain              0
arrondissement      0
suppl_address     698
number            999
address             0
id_location         0
name                7
genre               0
species            19
variety           803
circumference       0
height              0
stage             243
remarkable        102
geo_point_2d        0
dtype: int64

Percentage of null values per column:
idbase             0.00
location_type      0.00
domain             0.00
arrondissement     0.00
suppl_address     69.87
number           100.00
address            0.00
id_location        0.00
name               0.70
genre              0.00
species            1.90
variety           80.38
circumference      0.00
height             0.00
stage             24.32
remarkable        10.21
geo_point_2d       0.00
dtype: float64

Top 5 columns with most null values:
number           999
variety          803
suppl_address    698
stage           

## Exercise 2: Categorical Analysis

1. Use value_counts() to analyze the distribution of tree species
2. Calculate the percentage of trees in each arrondissement
3. Find the most common tree genres

In [3]:
# Distribution of tree species
print("Top 10 tree species:")
print(df['species'].value_counts().head(10))

# Trees per arrondissement
print("\nPercentage of trees per arrondissement:")
arrond_pct = (df['arrondissement'].value_counts() / len(df) * 100).round(2)
print(arrond_pct)

# Most common genres
print("\nTop 5 tree genres:")
print(df['genre'].value_counts().head())

Top 10 tree species:
species
x hispanica       167
hippocastanum      97
japonicum          59
tomentosa          41
sp.                39
x carnea           35
australis          30
nigra              25
pseudoplatanus     23
platanoides        22
Name: count, dtype: int64

Percentage of trees per arrondissement:
arrondissement
PARIS 13E ARRDT     8.41
PARIS 16E ARRDT     8.31
PARIS 15E ARRDT     7.81
PARIS 19E ARRDT     7.61
PARIS 12E ARRDT     7.51
PARIS 20E ARRDT     7.31
SEINE-SAINT-DENIS   6.01
BOIS DE VINCENNES   5.61
PARIS 17E ARRDT     5.41
PARIS 14E ARRDT     5.31
PARIS 18E ARRDT     5.01
PARIS 7E ARRDT      4.00
PARIS 8E ARRDT      3.50
PARIS 11E ARRDT     3.20
HAUTS-DE-SEINE      3.00
VAL-DE-MARNE        2.90
PARIS 10E ARRDT     2.00
BOIS DE BOULOGNE    1.80
PARIS 5E ARRDT      1.70
PARIS 4E ARRDT      1.20
PARIS 6E ARRDT      0.70
PARIS 9E ARRDT      0.60
PARIS 3E ARRDT      0.60
PARIS 1ER ARRDT     0.30
PARIS 2E ARRDT      0.20
Name: count, dtype: float64

Top 5 tree genr

## Exercise 3: Creating New Columns

1. Create a 'size_category' column based on height
2. Extract latitude and longitude from geo_point_2d
3. Create a 'tree_age_group' column based on stage and circumference

In [4]:
# Create size category
def get_size_category(height):
    if height <= 5:
        return 'Small'
    elif height <= 10:
        return 'Medium'
    else:
        return 'Large'

df['size_category'] = df['height'].apply(get_size_category)

# Extract coordinates
df[['latitude', 'longitude']] = df['geo_point_2d'].str.split(',', expand=True).apply(lambda x: x.str.strip())
df[['latitude', 'longitude']] = df[['latitude', 'longitude']].astype(float)

# Create age group
def get_age_group(row):
    if row['stage'] == 'Jeune (arbre)':
        return 'Young'
    elif row['circumference'] > 100:
        return 'Old'
    else:
        return 'Mature'

df['tree_age_group'] = df.apply(get_age_group, axis=1)

# Display results
print("Size categories distribution:")
print(df['size_category'].value_counts())

print("\nAge groups distribution:")
print(df['tree_age_group'].value_counts())

Size categories distribution:
size_category
Small     358
Medium    323
Large     318
Name: count, dtype: int64

Age groups distribution:
tree_age_group
Mature    528
Old       303
Young     168
Name: count, dtype: int64


## Exercise 4: Data Filtering and Analysis

1. Find all remarkable trees
2. Calculate average height and circumference by genre
3. Identify trees taller than the average height

In [5]:
# Find remarkable trees
remarkable_trees = df[df['remarkable'] == 'OUI']
print("Number of remarkable trees:", len(remarkable_trees))
print("\nRemarkable trees details:")
print(remarkable_trees[['name', 'genre', 'height', 'circumference', 'arrondissement']])

# Average dimensions by genre
genre_stats = df.groupby('genre').agg({
    'height': 'mean',
    'circumference': 'mean'
}).round(2)
print("\nAverage dimensions by genre (top 5):")
print(genre_stats.sort_values('height', ascending=False).head())

# Trees taller than average
avg_height = df['height'].mean()
tall_trees = df[df['height'] > avg_height]
print(f"\nNumber of trees taller than average ({avg_height:.2f}m):", len(tall_trees))

Number of remarkable trees: 1

Remarkable trees details:
        name     genre  height  circumference  arrondissement
434  Platane  Platanus      25            710  PARIS 7E ARRDT

Average dimensions by genre (top 5):
                   height  circumference
genre                                   
Sequoia             20.00         185.00
Platanus            13.34         122.20
Gymnocladus         11.50          64.00
Aesculus            11.39         113.12
x Cupressocyparis   10.50         114.00

Number of trees taller than average (8.69m): 453


## Exercise 5: Advanced Analysis

1. Calculate correlation between height and circumference
2. Find the distribution of trees by domain and stage
3. Analyze the relationship between location_type and tree dimensions

In [6]:
# Correlation analysis
correlation = df['height'].corr(df['circumference'])
print(f"Correlation between height and circumference: {correlation:.2f}")

# Distribution by domain and stage
domain_stage_dist = pd.crosstab(df['domain'], df['stage'])
print("\nDistribution of trees by domain and stage:")
print(domain_stage_dist)

# Analysis by location_type
location_stats = df.groupby('location_type').agg({
    'height': ['mean', 'std', 'count'],
    'circumference': ['mean', 'std']
}).round(2)
print("\nTree dimensions by location type:")
print(location_stats)

Correlation between height and circumference: 0.82

Distribution of trees by domain and stage:
stage         Adulte  Jeune (arbre)  Jeune (arbre)Adulte  Mature
domain                                                          
Alignement       225            105                  119      17
CIMETIERE         38              8                   10       8
DASCO             19              4                    8       1
DFPE               0              0                    1       0
DJS               14              2                    7       2
Jardin            65             47                   36      11
PERIPHERIQUE       6              2                    1       0

Tree dimensions by location type:
              height            circumference      
                mean  std count          mean   std
location_type                                      
Arbre           8.69 5.93   999         81.23 65.12


## Exercise 6: Challenge - Comprehensive Analysis

Create a comprehensive analysis that combines multiple aspects:
1. Identify the most diverse arrondissements in terms of tree species
2. Analyze the distribution of tree dimensions across different domains
3. Create a summary of tree characteristics by age group

In [7]:
# Species diversity by arrondissement
diversity_by_arrond = df.groupby('arrondissement')['species'].nunique().sort_values(ascending=False)
print("Number of unique species by arrondissement:")
print(diversity_by_arrond)

# Tree dimensions by domain
domain_dimensions = df.groupby('domain').agg({
    'height': ['mean', 'min', 'max'],
    'circumference': ['mean', 'min', 'max'],
    'species': 'nunique'
}).round(2)
print("\nTree dimensions by domain:")
print(domain_dimensions)

# Summary by age group
age_group_summary = df.groupby('tree_age_group').agg({
    'height': 'mean',
    'circumference': 'mean',
    'remarkable': lambda x: (x == 'OUI').sum(),
    'species': 'nunique'
}).round(2)
print("\nSummary by age group:")
print(age_group_summary)

Number of unique species by arrondissement:
arrondissement
PARIS 15E ARRDT      39
PARIS 13E ARRDT      39
PARIS 19E ARRDT      37
PARIS 20E ARRDT      34
PARIS 12E ARRDT      32
PARIS 16E ARRDT      30
PARIS 18E ARRDT      28
BOIS DE VINCENNES    28
SEINE-SAINT-DENIS    26
PARIS 17E ARRDT      25
PARIS 14E ARRDT      24
PARIS 11E ARRDT      17
HAUTS-DE-SEINE       15
VAL-DE-MARNE         13
PARIS 10E ARRDT      12
PARIS 5E ARRDT        9
PARIS 4E ARRDT        9
BOIS DE BOULOGNE      8
PARIS 8E ARRDT        8
PARIS 7E ARRDT        7
PARIS 3E ARRDT        5
PARIS 9E ARRDT        5
PARIS 6E ARRDT        4
PARIS 2E ARRDT        2
PARIS 1ER ARRDT       2
Name: species, dtype: int64

Tree dimensions by domain:
             height         circumference          species
               mean min max          mean min  max nunique
domain                                                    
Alignement    10.27   1  26         87.84   5  322      68
CIMETIERE      6.38   0  30         63.58   0  29