# Pandas Exercises with Tree Data

This notebook provides a series of exercises using a dataset of trees in Paris.  These exercises will focus on basic data cleaning, exploration, and filtering.

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('les_arbres_upload_1k.csv')

## Exercise 1: Basic Inspection

1.  Display the first 5 rows of the DataFrame using `.head()`.
2.  Display the last 5 rows of the DataFrame using `.tail()`.
3.  Get and print the shape of the DataFrame (number of rows and columns) using `.shape`.
4. Get and print the column names using `.columns`.


In [None]:
# Exercise 1 Solution Here


In [2]:
print(df.head())
print(df.tail())
print(df.shape)
print(df.columns)

   idbase location_type     domain   arrondissement  suppl_address  number  \
0  249403         Arbre Alignement  PARIS 20E ARRDT             54     NaN   
1  2020183        Arbre     Jardin  PARIS 17E ARRDT  Sortie des Charmes     NaN   
2  290961         Arbre Alignement  PARIS 12E ARRDT             35     NaN   
3  2038819        Arbre    CIMETIERE  SEINE-SAINT-DENIS            NaN     NaN   
4  2042131        Arbre     Jardin  PARIS 18E ARRDT            NaN     NaN   

                                            address  id_location  \
0                                    AVENUE GAMBETTA    1402008   
1  PARC CLICHY BATIGNOLLES MARTIN LUTHER KING / 1...  E00301002   
2                                  AVENUE COURTELINE     503003   
3             CIMETIERE DE PANTIN / DIV 124  D00000124006   
4   PARC CHAPELLE CHARBON - 5Z RUE DE LA CROIX M...     404003   

          name      genre    species     variety  circumference  height  \
0       Tilleul      Tilia  tomentosa         NaN 

## Exercise 2: Handling Missing Data

1. Check for missing values in each column using `.isna().sum()`, and print the result.
2. Fill the missing values in the `variety` column with 'Unknown' using `.fillna()` and then display the number of missing values using `.isna().sum()`. 

In [None]:
# Exercise 2 Solution Here

In [3]:
print(df.isna().sum())
df['variety'] = df['variety'].fillna('Unknown')
print(df.isna().sum())

idbase               0
location_type       0
domain              0
arrondissement      0
suppl_address     464
number            825
address             0
id_location         0
name                0
genre               0
species             0
variety           693
circumference       0
height              0
stage               0
remarkable          0
geo_point_2d        0
dtype: int64
idbase               0
location_type       0
domain              0
arrondissement      0
suppl_address     464
number            825
address             0
id_location         0
name                0
genre               0
species             0
variety              0
circumference       0
height              0
stage               0
remarkable          0
geo_point_2d        0
dtype: int64


## Exercise 3: Value Counts

1.  Calculate and print the value counts for the `location_type` column.
2.  Calculate and print the value counts for the `arrondissement` column.
3.  Calculate and print the value counts for the `species` column. 

In [None]:
# Exercise 3 Solution Here

In [4]:
print(df['location_type'].value_counts())
print(df['arrondissement'].value_counts())
print(df['species'].value_counts())

Arbre    1000
Name: location_type, dtype: int64
PARIS 16E ARRDT        108
PARIS 12E ARRDT        100
PARIS 19E ARRDT        100
SEINE-SAINT-DENIS       98
PARIS 13E ARRDT        81
PARIS 17E ARRDT        75
BOIS DE VINCENNES       62
PARIS 20E ARRDT        57
PARIS 14E ARRDT        55
PARIS 7E ARRDT         45
PARIS 18E ARRDT        38
PARIS 5E ARRDT         33
BOIS DE BOULOGNE        19
PARIS 8E ARRDT         19
PARIS 10E ARRDT        17
PARIS 11E ARRDT        15
VAL-DE-MARNE          13
PARIS 4E ARRDT         11
PARIS 6E ARRDT          6
PARIS 9E ARRDT          4
HAUTS-DE-SEINE         4
PARIS 2E ARRDT          2
Name: arrondissement, dtype: int64
x hispanica          198
hippocastanum         126
tomentosa              87
betulus                49
cordata                37
sp.                    33
nigra                  32
calleryana             30
x carnea               23
ornus                  22
australis              20
platanoides            19
excelsior              17
serr

## Exercise 4: Data Type Conversion

1. Convert the `circumference` column to integer type using `.astype(int)`. 
2. Convert the `height` column to float type using `.astype(float)`. 

In [None]:
# Exercise 4 Solution Here

In [5]:
df['circumference'] = df['circumference'].astype(int)
df['height'] = df['height'].astype(float)
print(df[['circumference','height']].dtypes)

circumference    int64
height           float64
dtype: object


## Exercise 5: Filtering and Subsetting

1.  Create a new DataFrame `df_tall_trees` that contains only the trees with a `height` greater than 20.
2. Create a new DataFrame `df_platanes` that contains only the trees of `species` equal to 'x hispanica'.
3. Create a new DataFrame `df_old_trees` that contains only the trees of `stage` equal to 'Mature'.

In [None]:
# Exercise 5 Solution Here

In [6]:
df_tall_trees = df[df['height'] > 20]
print(df_tall_trees.shape)
df_platanes = df[df['species'] == 'x hispanica']
print(df_platanes.shape)
df_old_trees = df[df['stage'] == 'Mature']
print(df_old_trees.shape)

(119, 17)
(198, 17)
(49, 17)


# Pandas Exercises with Tree Data

This notebook provides a series of exercises using a dataset of trees in Paris.  These exercises will focus on basic data cleaning, exploration, and filtering.

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('les_arbres_upload_1k.csv')

## Exercise 1: Basic Inspection

1.  Display the first 5 rows of the DataFrame using `.head()`.
2.  Display the last 5 rows of the DataFrame using `.tail()`.
3.  Get and print the shape of the DataFrame (number of rows and columns) using `.shape`.
4. Get and print the column names using `.columns`.


In [None]:
# Exercise 1 Solution Here


In [None]:
print(df.head())
print(df.tail())
print(df.shape)
print(df.columns)

## Exercise 2: Handling Missing Data

1. Check for missing values in each column using `.isna().sum()`, and print the result.
2. Fill the missing values in the `variety` column with 'Unknown' using `.fillna()` and then display the number of missing values using `.isna().sum()`. 

In [None]:
# Exercise 2 Solution Here

In [None]:
print(df.isna().sum())
df['variety'] = df['variety'].fillna('Unknown')
print(df.isna().sum())

## Exercise 3: Value Counts

1.  Calculate and print the value counts for the `location_type` column.
2.  Calculate and print the value counts for the `arrondissement` column.
3.  Calculate and print the value counts for the `species` column. 

In [None]:
# Exercise 3 Solution Here

In [None]:
print(df['location_type'].value_counts())
print(df['arrondissement'].value_counts())
print(df['species'].value_counts())

## Exercise 4: Data Type Conversion

1. Convert the `circumference` column to integer type using `.astype(int)`. 
2. Convert the `height` column to float type using `.astype(float)`. 

In [None]:
# Exercise 4 Solution Here

In [None]:
df['circumference'] = df['circumference'].astype(int)
df['height'] = df['height'].astype(float)
print(df[['circumference','height']].dtypes)

## Exercise 5: Filtering and Subsetting

1.  Create a new DataFrame `df_tall_trees` that contains only the trees with a `height` greater than 20.
2. Create a new DataFrame `df_platanes` that contains only the trees of `species` equal to 'x hispanica'.
3. Create a new DataFrame `df_old_trees` that contains only the trees of `stage` equal to 'Mature'.

In [None]:
# Exercise 5 Solution Here

In [None]:
df_tall_trees = df[df['height'] > 20]
print(df_tall_trees.shape)
df_platanes = df[df['species'] == 'x hispanica']
print(df_platanes.shape)
df_old_trees = df[df['stage'] == 'Mature']
print(df_old_trees.shape)

## Exercise 6: Creating a new Column with Lambda

1. Create a new column `diameter` by calculating the diameter from the `circumference` column. Use a lambda function with `apply()` (diameter = circumference / pi ).


In [None]:
# Exercise 6 Solution Here

In [None]:
import numpy as np
df['diameter'] = df['circumference'].apply(lambda x: x / np.pi)
print(df[['idbase','circumference','diameter']].head())

## Exercise 7: Exporting Data

1. Calculate the value counts of the `species` column again and store the result in a new dataframe `species_counts` using `.value_counts().reset_index()`
2. Export `species_counts` DataFrame to a CSV file named `species_counts.csv` without the index using `.to_csv(index=False)`

In [None]:
# Exercise 7 Solution Here

In [None]:
species_counts = df['species'].value_counts().reset_index()
print(species_counts)
species_counts.to_csv('species_counts.csv', index=False)

## Exercise 8: Plotting a Histogram

1.  Create a histogram of the `circumference` column with 100 bins using `.hist(bins=100)`.

In [None]:
# Exercise 8 Solution Here

In [None]:
df['circumference'].hist(bins=100)