## ⚛️ Data Science - Training Activities ⚙️ (Activity N° 1)

#### *By: Jiliar Silgado Cardona*

✅ [***LinkedIn***](https://www.linkedin.com/in/jiliar-silgado-cardona-4b970b286/)

✅ [***GitHub***](https://github.com/Jiliar)

### 💡Introduction to the Activity Report: Functions, Variable Types, and Data Cleaning

This report presents the results of two key activities focused on data analysis and manipulation. Both activities involved using Jupyter Notebook to work with a given dataset, applying specific functions to explore and clean the variables it contains.

For the development of the following activities, functions in Python language are stipulated for reading and processing information.

#### General Resources for Solutions:

##### Functions

In [1]:
# Pandas Importing
import pandas as pd 

def get_file_details(file, index):

    # Reading the CSV file into a DataFrame, with the specified column set as the index.
    df = pd.read_csv(file, index_col = index)  

    # Stripping leading and trailing spaces from the column headers.
    df.columns = df.columns.str.strip()

    # Returning the cleaned DataFrame.
    return df 

### Activity 1: Functions and Variable Types

#### ✍️ Statement

Write in a Jupyter HTML file code blocks and Markdown where the variables from a given dataset are described as follows:

+ The type of each variable.
+ Apply the corresponding functions based on the type of each variable to determine:
+ The total number of values.
+ The distinct values.
+ The null values.

#### 💡Introduction

In the first activity, the variables in the dataset were identified and classified according to their type (numerical, categorical, etc.). For each type of variable, specific functions were applied to determine the total number of values, distinct values, and the presence of null values. The output of this activity is an HTML file that includes both the code used and Markdown descriptions of the results obtained.

#### 📥 Input:

***Download Page:*** [Census Income](https://archive.ics.uci.edu/dataset/20/census+income)

#### Data:

In [2]:
input1 = "../../data/csv/census_income-1.csv" # Specify the path to the CSV file.

# Call the get_file_details function, passing in the file path and setting the first column (index 0) as the index.
# Use the .head() method to display the first 5 rows of the DataFrame.
get_file_details(input1, 0).head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### ⚙️ Data Processing:

### 📝 1. Types Variables:

In [3]:
# Load the CSV file, specifying that the last column (index -1) should be set as the index.
df = get_file_details(input1, -1)

# Convert all columns with data type 'object' to 'string' type. because all objects are strings in this file.
df = df.astype({col: 'string' for col in df.select_dtypes(include='object').columns})

# Print the data types of each column in the DataFrame to verify the conversion.
print(df.dtypes)

Unnamed: 0                 int64
age                        int64
workclass         string[python]
fnlwgt                     int64
education         string[python]
education-num              int64
marital-status    string[python]
occupation        string[python]
relationship      string[python]
race              string[python]
sex               string[python]
capital-gain               int64
capital-loss               int64
hours-per-week             int64
native-country    string[python]
dtype: object


### 📝 2. The amount of values ​​there are

##### ✏️ 2.1. The number of values ​​for each variable are (without blanks):

In [4]:
# Calculate the total number of non-null values in each column of the DataFrame.
total_values_per_column = df.count()

# Print the result, which shows the count of non-null values for each column.
print(total_values_per_column)

Unnamed: 0        48842
age               48842
workclass         47879
fnlwgt            48842
education         48842
education-num     48842
marital-status    48842
occupation        47876
relationship      48842
race              48842
sex               48842
capital-gain      48842
capital-loss      48842
hours-per-week    48842
native-country    48568
dtype: int64


##### ✏️ 2.2. Total number of values is:

In [5]:
# Calculate the total number of non-null values across all columns in the DataFrame.
# Print the result, which shows the total count of non-null values in the entire DataFrame.

print(df.count().sum())

730427


### 📝 3. Distinct values

##### ✏️ 3.1. Distinct values by each variable are:

In [6]:
# Calculate and print the total number of unique values across all columns in the DataFrame.

print(df.nunique().sum())

77875


#### ✏️ 3.2. Total distinct values are:

In [7]:
# Calculate and print the number of unique values for each column in the DataFrame.

print(df.nunique())

Unnamed: 0        48842
age                  74
workclass             9
fnlwgt            28523
education            16
education-num        16
marital-status        7
occupation           15
relationship          6
race                  5
sex                   2
capital-gain        123
capital-loss         99
hours-per-week       96
native-country       42
dtype: int64


### 📝 4. Null values

##### ✏️ 4.1. Null values by each variable are:

In [8]:
# Calculate and print the number of missing (NaN) values for each column in the DataFrame.

print(df.isnull().sum())

Unnamed: 0          0
age                 0
workclass         963
fnlwgt              0
education           0
education-num       0
marital-status      0
occupation        966
relationship        0
race                0
sex                 0
capital-gain        0
capital-loss        0
hours-per-week      0
native-country    274
dtype: int64


##### ✏️ 4.2. Total null values are:

In [9]:
# Calculate and print the total number of missing (NaN) values across the entire DataFrame.

print(df.isnull().sum().sum())

2203


### ✅ Activity 2: IF Statements for Data Cleaning

#### ✍️ Statement:

Write in a Jupyter file code blocks and Markdown where the variables from a given dataset are described as follows:

+ Initial variables.
+ Apply the corresponding functions to clean these variables.

#### 💡Introduction

The second activity focused on cleaning the variables previously explored. Using code blocks with conditional structures (IF statements), various functions were applied to improve data quality by removing additional characters and handling low-quality variables. As with the first activity, the final results were documented and submitted in an HTML file.

#### 📥 Input:

***Download Page:*** [A Titanic Probability](https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/problem12.html)

In [10]:
# Specify the path to the CSV file.
input2 = "../../data/csv/titanic-1.csv"

# Call the get_file_details function, passing in the file path and setting the first column (index 0) as the index.
# Use the .head() method to display the first 5 rows of the DataFrame.
get_file_details(input2, 0).head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


### ⚙️ Data Processing:

### 1. Initial variables.

In [11]:
# Load the CSV file, specifying that the last column (index -1) should be set as the index.
df = get_file_details(input2, -1)

# Print the data types of each column in the DataFrame to verify the data types of the loaded data.
print(df.dtypes)

Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
dtype: object


### 2. Apply the corresponding functions to clean these variables.

#### ✏️ 2.1. Convert columns of type object to string

In [12]:
# Load the CSV file, specifying that the last column (index -1) should be set as the index.
df = get_file_details(input2, -1)

# Convert all columns with data type 'object' to 'string' type.
df = df.astype({col: 'string' for col in df.select_dtypes(include='object').columns})

# Print the data types of each column in the DataFrame to verify the conversion.
print(df.dtypes)

Survived                            int64
Pclass                              int64
Name                       string[python]
Sex                        string[python]
Age                               float64
Siblings/Spouses Aboard             int64
Parents/Children Aboard             int64
dtype: object


#### ✏️ 2.2. Function to delete rows where all values ​​are NaN

In [13]:
# Calculate the maximum number of non-null values across all columns to determine the number of rows.
num_rows = df.count().max()
# Print the number of rows before removing any rows with all null values.
print('num_rows (before)', num_rows)

# Define a function to remove rows that contain all null values.
def remove_null_rows(df):
    # Check if there are any rows where all values are null.
    if df.isnull().all(axis=1).sum() > 0:
        # Drop rows where all values are null.
        df.dropna(how='all', inplace=True)
        # Print a message indicating that rows with all null values have been removed.
        print("Rows with all null values removed.")
    else:
        # Print a message if no rows with all null values are found.
        print("There are no rows to remove.")
    # Return the updated DataFrame.
    return df

# Call the remove_null_rows function to clean the DataFrame.
df = remove_null_rows(df)

# Recalculate the maximum number of non-null values across all columns to determine the number of rows after cleanup.
num_rows = df.count().max()
# Print the number of rows after removing rows with all null values.
print('num_rows (after)', num_rows)


num_rows (before) 887
There are no rows to remove.
num_rows (after) 887


#### ✏️ 2.3. Function to remove duplicate rows

In [14]:
# Calculate the maximum number of non-null values across all columns to determine the number of rows.
num_rows = df.count().max()
# Print the number of rows before removing any duplicate rows.
print('num_rows (before)', num_rows)

# Define a function to remove duplicate rows from the DataFrame.
def delete_duplicates(df):
    # Check if there are any duplicate rows in the DataFrame.
    if df.duplicated().any():
        # Drop duplicate rows, keeping only the first occurrence.
        df.drop_duplicates(inplace=True)
        # Print a message indicating that duplicate rows have been removed.
        print("Duplicate rows removed.")
    else:
        # Print a message if no duplicate rows are found.
        print("There are no duplicates to remove.")
    # Return the updated DataFrame.
    return df

# Call the delete_duplicates function to remove duplicate rows from the DataFrame.
df = delete_duplicates(df)

# Recalculate the maximum number of non-null values across all columns to determine the number of rows after cleanup.
num_rows = df.count().max()
# Print the number of rows after removing duplicate rows.
print('num_rows (after)', num_rows)

num_rows (before) 887
There are no duplicates to remove.
num_rows (after) 887


#### ✏️ 2.4. Function to clean whitespace in text columns

In [15]:
# Define a function to remove leading and trailing white spaces from string columns in the DataFrame.
def clear_white_spaces(df):
    # Iterate over each column in the DataFrame that has data type 'string' or 'object'.
    for col in df.select_dtypes(include=['string', 'object']):
        # Check if the column contains any white spaces.
        if df[col].str.contains(' ').any():
            # Remove leading and trailing white spaces from each value in the column.
            df[col] = df[col].str.strip()
            # Print a message indicating that white spaces have been removed from the column.
            print(f"Blank spaces removed from column '{col}'.")
        else:
            # Print a message if no white spaces are found in the column.
            print(f"No Blank spaces in column '{col}'.")
    # Return the updated DataFrame with white spaces removed.
    return df

# Call the clear_white_spaces function to clean white spaces from string columns in the DataFrame.
df = clear_white_spaces(df)

Blank spaces removed from column 'Name'.
No Blank spaces in column 'Sex'.


#### ✏️ 2.5. Function to replace negative values with Zero (0)

In [16]:
# Define a function to replace negative values in a specified column of the DataFrame with zero.
def replace_negative_values(df, column):
    # Check if there are any negative values in the specified column.
    if (df[column] < 0).any():
        # Replace negative values with zero in the specified column.
        df.loc[df[column] < 0, column] = 0
        # Print a message indicating that negative values have been replaced.
        print(f"Negative values in the column '{column}' replaced.")
    else:
        # Print a message if there are no negative values in the column.
        print(f"There are no values less than zero in column '{column}'")
    # Return the updated DataFrame.
    return df

# Call the replace_negative_values function to replace negative values with zero in the 'Survived' column.
df = replace_negative_values(df, 'Survived')

# Call the replace_negative_values function to replace negative values with zero in the 'Pclass' column.
df = replace_negative_values(df, 'Pclass')

# Call the replace_negative_values function to replace negative values with zero in the 'Age' column.
df = replace_negative_values(df, 'Age')

# Call the replace_negative_values function to replace negative values with zero in the 'Siblings/Spouses Aboard' column.
df = replace_negative_values(df, 'Siblings/Spouses Aboard')

# Call the replace_negative_values function to replace negative values with zero in the 'Parents/Children Aboard' column.
df = replace_negative_values(df, 'Parents/Children Aboard')

There are no values less than zero in column 'Survived'
There are no values less than zero in column 'Pclass'
There are no values less than zero in column 'Age'
There are no values less than zero in column 'Siblings/Spouses Aboard'
There are no values less than zero in column 'Parents/Children Aboard'
