# Part I: Data management with Pandas
Pandas is used to manipulate, clean, and query data by looking at the Pandass data tool kit. Pandass was created
by Wes McKinny in 2008 and is an open source project under a very permissive license.

## Exercise 1: Pandss Series data structure
If you are familiar with Pandas and Series manipulation, you can skip this one and go to the second exercise. To
start, you will create a new notebook using your local environment or colab environment. You have to name your
notebook with PW2-[firstName Lastname MajorName]. Now you can answer the following questions
1. import Pandas and create a list of sports containing this list [
′Football′
,
′ HandBall′
,
′ SnowSport′
] and
store it in a variable sports;


In [142]:
import pandas as pd

# Create a list of sports
sports = ['Football', 'HandBall', 'SnowSport']

2. cast your list as a series and display sports using the Class Pandass.Series (see documention on class Pandass.Series(data=None, index=None, dtype=None, name=None, copy=None, fastpath=False))

In [143]:
sports_series = pd.Series(sports)
print(sports_series)

0     Football
1     HandBall
2    SnowSport
dtype: object


3. what is the type of each element of your list?

In [144]:
for item in sports:
    print(type(item))

<class 'str'>
<class 'str'>
<class 'str'>


4. Create a new list named numeric containing a short list of numbers and display it as a series.

In [145]:
numeric = [5, 10, 15]
numeric_series = pd.Series(numeric)
print(numeric_series)

0     5
1    10
2    15
dtype: int64


5. add the value None to both animals and numeric and compare their corresponding display with Series.
What did you remark?


In [146]:
numeric_with_none = [None, 15, 100]
sports_with_none = ['BasketBall', 'HandBall', None]

numeric_series_with_none = pd.Series(numeric_with_none)
sports_series_with_none = pd.Series(sports_with_none)

print(numeric_series_with_none)
print(sports_series_with_none)

0      NaN
1     15.0
2    100.0
dtype: float64
0    BasketBall
1      HandBall
2          None
dtype: object


Observation:

When None is added to the numeric list, it is converted to NaN in the series.
In the sports list, None appears as an empty object in the series.

6. We shall correct the incomplete value None of sports by changing the None by NaN (Numpy.nan)

In [147]:
import numpy as np

sports_with_nan = ['BasketBall', 'HandBall', np.nan]
sports_series_with_nan = pd.Series(sports_with_nan)
print(sports_series_with_nan)

0    BasketBall
1      HandBall
2           NaN
dtype: object


7. Let construct our variable sports from a dictionary as follows: {'BasketBall', 'HandBall', 'Snowsport',
'baseBall', 'Swimming'}. Cast the variable to a serie and store it a new variable sIndex. Display sIndex
and call the function index on it. What did you remark?


In [148]:
sports_dict = {'BasketBall': 'HandBall', 'Snowsport': 'baseBall', 'Swimming': None}
sIndex = pd.Series(sports_dict)

print(sIndex)
print(sIndex.index)

BasketBall    HandBall
Snowsport     baseBall
Swimming          None
dtype: object
Index(['BasketBall', 'Snowsport', 'Swimming'], dtype='object')


## Exercise 2: Series Querying
Let we consider the variable sports from a dictionary.

In [149]:
import numpy as np
np.isnan(np.nan)
sports={'bask': 'BasketBall', 'hand': 'HandBall', 'snow': 'Snowsport', 'base': 'baseBall','swim': 'Swimming'}
#cast the list on a serie
sIndex=pd.Series(sports)
#display the serie
display(sIndex)

bask    BasketBall
hand      HandBall
snow     Snowsport
base      baseBall
swim      Swimming
dtype: object

1. print the second element of the variable sports and find the element who has ’swim’. (use iloc[] to index
location and loc[] to label location).


In [150]:
# Print second element using iloc (index-based)
print(sIndex.iloc[1])  # Second element based on position

# Find the element labeled 'swim' using loc (label-based)
print(sIndex.loc['swim'])  # Element with label 'swim'

HandBall
Swimming


2. Transform all items of your list sports in an uppercase characters (see Series.str.upper())

In [151]:
sIndex_upper = sIndex.str.upper()
print(sIndex_upper)

bask    BASKETBALL
hand      HANDBALL
snow     SNOWSPORT
base      BASEBALL
swim      SWIMMING
dtype: object


3. write a simple loop to transform sports in an uppercase Characters

In [152]:
sIndex_upper_loop = pd.Series([item.upper() if item else None for item in sIndex])
print(sIndex_upper_loop)

0    BASKETBALL
1      HANDBALL
2     SNOWSPORT
3      BASEBALL
4      SWIMMING
dtype: object


4. compare the runtime between both solutions and explain the runtime gap (see %timeit which is an IPython
magic function, which can be used to time a particular piece of code.)


In [153]:
# Using Series.str.upper() with vectorization
%timeit sIndex.str.upper()

# Using a loop
%timeit pd.Series([item.upper() if item else None for item in sIndex])

92.3 µs ± 6.18 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
69 µs ± 7.25 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


Explanation:

The Series.str.upper() function is vectorized and runs faster because Pandas internally optimizes operations on entire arrays.
The loop, though simple, operates element by element, making it slower.

5. compare the runtime between the function np.mean() and a loop to calculate the mean value of your numeric
variable

In [154]:
numeric = [None, 15, 100]
num = pd.Series(numeric)
%timeit np.mean(num)

41.2 µs ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [155]:
mean_value = 0
total = 0
count = 0
for item in num:
    if pd.notna(item):  # Skip NaN values
        total += item
        count += 1
mean_value = total / count if count != 0 else 0
print(mean_value)

%timeit

57.5


Explanation of Runtime Difference:

Vectorization (using np.mean()): This approach is faster since NumPy operations are highly optimized for performance and can handle entire arrays at once.
Loop: It involves explicit iteration over elements, which is slower, especially for large datasets, because each operation is executed individually rather than leveraging efficient underlying array operations.

# Part II: data Verification
## Exercise 3: From CSV to DataFrame
As a first step, we need python libraries allowing us to load our data as a dataframe. You have to download the
csv file named Custemers.csv. This file contains 15 columns separated with a ’,’ character.

1. Load the file Custemers.csv and store its content into a dataframe variable Custemers Data

In [156]:
import pandas as pd

# Load the file into a DataFrame (ensure the path to the file is correct)
Custemers_Data = pd.read_csv('data/Custemers.csv', delimiter=',')

2. Display the shape of your dataframe and display the header to understand the meaning of each column.

In [157]:
# Display the shape of the DataFrame (rows, columns)
print(Custemers_Data.shape)

# Display the first few rows of the DataFrame
print(Custemers_Data.head())

(10000, 15)
                                                name  \
0                                       Deonte Stark   
1                                     Faustino Boyer   
2  Eddy Bogisich,33431 Dollie Squares Apt. 654,Po...   
3                                     Mervyn Kreiger   
4  Katlyn Doyle,4650 Beer Crossing Suite 848,Nort...   

                        address             city           state         zip  \
0            278 Mueller Plains       North Euna         Alabama  03404-4384   
1  70244 Skiles Falls Suite 030  North Altohaven      California  01522-1310   
2                           NaN              NaN             NaN         NaN   
3            376 Dorinda Stream     Shaniquafort  South Carolina  39347-4438   
4                           NaN              NaN             NaN         NaN   

                 phone                          email           work  \
0   (180)940-9676x4495           shanna73@hotmail.com     Hahn-Mayer   
1  (308)699-6239x81011    

3. Use the dataframe.columns to display the dictionary indexing your data

In [158]:
# Display column names
print(Custemers_Data.columns)

Index(['name', 'address', 'city', 'state', 'zip', 'phone', 'email', 'work',
       'work address', 'work city', 'work state', 'work zipcode', 'work phone',
       'work email', 'account created on'],
      dtype='object')


## Exercise 4: Catch missing values
Now we want to evaluate the number of missing data in the hole data. Display the total of missing values in each
column of custemers Data
1. You can use dataframe.isnull() and verctorization to sum missing values by each column

In [159]:
# Check for missing values in each column
missing_values = Custemers_Data.isnull().sum()
print(missing_values)

name                    81
address               3365
city                  3374
state                 3365
zip                   3362
phone                 3366
email                 3363
work                  3390
work address          3383
work city             3378
work state            3390
work zipcode          3378
work phone            3377
work email            3359
account created on    3368
dtype: int64


2. Display the rows with missing values in the ’name’ column, you should have 81 rows

In [160]:
# Display rows where 'name' column has missing values
missing_names = Custemers_Data[Custemers_Data['name'].isnull()]
print(missing_names)

# Check the number of rows with missing names
print(f"Number of rows with missing names: {missing_names.shape[0]}")

     name                          address              city           state  \
148   NaN     558 Brycen Mission Suite 152        Cristmouth        Arkansas   
273   NaN       0481 Sanford Lake Apt. 439     Bashirianberg  North Carolina   
300   NaN    38689 Kimora Groves Suite 807    New Nadiahaven         Vermont   
330   NaN       077 Walsh Summit Suite 123        Rogahnfurt         Indiana   
467   NaN              87140 Loma Crescent   North Dixieport        Michigan   
...   ...                              ...               ...             ...   
9316  NaN               054 Aubrie Corners    East Genevieve   New Hampshire   
9487  NaN       9957 Rempel Wells Apt. 081         New Micky         Alabama   
9745  NaN  6277 Schneider Common Suite 939         Port Aura    North Dakota   
9816  NaN               986 Brianne Shoals  Port Wilbertstad     Mississippi   
9834  NaN             19233 Kreiger Meadow        Kleinville        Michigan   

             zip               phone   

3. Try removing rows with missing names. We call dropna() function which has many arguments to scale the
level of a filtered data:
• axis=0 or 1 allows to filter the missing values according to the row(axis=0) or the column(axis=1)
• how=all allows to eliminate all rows or all columns having mainly missing values
• inplace = True allow to need the same dataframe without creating a new output
• thresh= integer value allows to keep only rows (or columns) having a missing value rate more than
thresh.
test the function dropna() on your dataframe as:
• dropna(how=’all’, inplace=True) and check if missing values are removed!
• dropna(inplace=True, axis=0) and check if missing values are removed!
What did you conclude?


In [161]:
Custemers_Data.dropna(how='all', inplace=True)
# Check if any rows with all missing values were removed
print(Custemers_Data.isnull().sum())

name                    81
address               3365
city                  3374
state                 3365
zip                   3362
phone                 3366
email                 3363
work                  3390
work address          3383
work city             3378
work state            3390
work zipcode          3378
work phone            3377
work email            3359
account created on    3368
dtype: int64


In [162]:
Custemers_Data.dropna(inplace=True, axis=0)
# Check if rows with any missing values were removed
print(Custemers_Data.isnull().sum())

name                  0
address               0
city                  0
state                 0
zip                   0
phone                 0
email                 0
work                  0
work address          0
work city             0
work state            0
work zipcode          0
work phone            0
work email            0
account created on    0
dtype: int64


When you use dropna(how='all'), it will only remove rows that are entirely made up of NaN values. If a row has even one non-NaN value, it will not be removed.
When you use dropna(axis=0), it removes all rows that contain any missing value. If you had missing values scattered throughout the DataFrame, this would remove a significant number of rows.

## Exercise 5: Ensure that values have the right format/type
If we take the ”name” column, we find the first name and the last name separated with a blank character
1. check the type of each column using the function dtype()


In [163]:
# Check the data type of each column
print(Custemers_Data.dtypes)

name                  object
address               object
city                  object
state                 object
zip                   object
phone                 object
email                 object
work                  object
work address          object
work city             object
work state            object
work zipcode          object
work phone            object
work email            object
account created on    object
dtype: object


2. Let we split the column ’name’ into two columns where the first string is the first name and the second
string is the last name. (see str.split()).


In [164]:
# Split 'name' column into 'first_name' and 'last_name' columns
Custemers_Data[['first_name', 'last_name']] = Custemers_Data['name'].str.split(' ', 1, expand=True)
print(Custemers_Data[['first_name', 'last_name']].head())

TypeError: StringMethods.split() takes from 1 to 2 positional arguments but 3 positional arguments (and 1 keyword-only argument) were given

3. You can remark incorrect lastname values. Remove rows having lastname length up to 16 characters.
Take 5 mn to outline the main ideas you’ve retained from these exercises (Part I and II)

In [None]:
# Remove rows where 'last_name' has more than 16 characters
Custemers_Data = Custemers_Data[Custemers_Data['last_name'].str.len() <= 16]
print(Custemers_Data.shape)


Here are the key ideas retained from these exercises:

Data Management with Pandas: You learned how to import and load data from a CSV file into a Pandas DataFrame, how to check the structure of the DataFrame (using functions like shape(), head(), columns), and how to query data effectively.
Handling Missing Data: You gained experience using Pandas functions like isnull(), dropna(), and filtering data based on missing values. We also explored different ways to handle missing data, like dropping rows or columns with missing values.
String Operations: You practiced string manipulations such as splitting strings within a column into multiple columns and cleaning up data by handling incorrect values (e.g., removing rows with long last_name values).
Vectorization vs Loops: You learned the benefits of vectorized operations, especially for performance, and how they compare to loops.
Data Integrity: Ensuring that the data types and formats of columns are correct is essential for any further analysis or preprocessing steps in the machine learning pipeline.

# Part III: Data repairing with imputation
## Exrcise 6: Do it from scratch at home and upload it on Moodle
Now you have new datasets ”Olympics.csv” and ”flicker.csv” available on Moodle. You have to define your own
strategy to clean these data and comment your notebook at each step.


# Exercise 7: Imputation to handle missing data
In this exercise we will compare two solutions of data cleaning. The first one consists to drop missing columns values while the second is to fill missing values with interesting ones. In your notebook, open the file melb data.csv.
1. load your data as a dataframe and use a clear name as ”initialData”

In [None]:
import pandas as pd

# Load the dataset
initialData = pd.read_csv('data/melb_data.csv')


2. observe first the quality of the data by using these functions: dataframe.columns, dataframe.info(), and
dataframe.shape.


In [None]:
# Display the column names
print(initialData.columns)

# Display info about data types and non-null counts
print(initialData.info())

# Display the shape (rows, columns) of the DataFrame
print(initialData.shape)


3. display the total missing values by columns (axis=0)


In [None]:
# Total missing values by columns
missing_by_columns = initialData.isnull().sum()
print(missing_by_columns)


4. display total missing Values by rows (axis=1)


In [None]:
# Total missing values by rows
missing_by_rows = initialData.isnull().sum(axis=1)
print(missing_by_rows)


5. list the columns with missing values and store them in a variable colsWithMissing. Use the function isnull()
and try to write a simple function you can call it with other datasets with the following signature and core
code as following:
If those columns had relevant information your model loses access to it when the column is dropped. Another drawback to this solution is to miss to do the same droping on the test dataset where an error will
occur.

In [None]:
def missing_columns(originDB):
    # List of columns with missing values
    colsWithMissing = [col for col in originDB.columns if originDB[col].isnull().any()]
    
    # Drop columns with missing values
    reduced_original_data = originDB.drop(colsWithMissing, axis=1)
    
    return colsWithMissing, reduced_original_data

# Call the function on initialData
colsWithMissing, reduced_initialData = missing_columns(initialData)
print("Columns with missing values:", colsWithMissing)


6. display the rate of missing values by columns


In [None]:
# Percentage of missing values by columns
missing_rate_columns = (initialData.isnull().sum() / len(initialData)) * 100
print(missing_rate_columns)


7. do the same on the rows


In [None]:
# Percentage of missing values by rows
missing_rate_rows = (initialData.isnull().sum(axis=1) / initialData.shape[1]) * 100
print(missing_rate_rows)


8. remove rows whom the rate of missing values are > 5% from the origin data and store the result on a new
dataframe variable named new Data.


In [None]:
# Threshold for removing rows with more than 5% missing values
threshold = 0.05 * initialData.shape[1]

# Remove rows with missing value rate > 5%
newData = initialData[initialData.isnull().sum(axis=1) <= threshold]
print("Shape of newData:", newData.shape)


9. call your function missing colums(originDB) on both original data and on new data obtained after removal
rows. How do you explain the columns difference?


In [None]:
# Call on initialData
colsWithMissing_initial, reduced_initialData = missing_columns(initialData)

# Call on newData after removing rows
colsWithMissing_new, reduced_newData = missing_columns(newData)

print("Columns with missing values in initial data:", colsWithMissing_initial)
print("Columns with missing values in new data:", colsWithMissing_new)


10.  fill the missing values with the mean price, so you have to: 1) Display the statistical description of the
column Price using the function describe(), 2) then to calculate the mean value and 3) to fill the missing
value with the mean:

In [None]:
# Statistical description of the 'Price' column
print(newData['Price'].describe())


In [None]:
# Calculate the mean of the 'Price' column
price_mean = newData['Price'].mean()
print("Mean Price:", price_mean)


In [None]:
# Fill missing values in 'Price' with the mean value
newData['Price'].fillna(price_mean, inplace=True)
