# Data Cleaning Notebook

## Objectives

*   Check Column Naming
*   Check for NaN values in the DataFrame
*   Data Cleaning
*   Data Exploration

## Inputs

* extracted_features.csv

## Outputs

* corrected_extracted_features.csv
* x_test.csv
* x_train.csv
* y_test.csv
* y_train.csv

## Conclusions

 
  * 
  * 



---


# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [14]:
import os
current_dir = os.getcwd()
current_dir

'/home/jaaz/Desktop/project-5/TailTeller'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("New current directory:", os.getcwd())

New current directory: /home/jaaz/Desktop/project-5/TailTeller


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/home/jaaz/Desktop/project-5/TailTeller'

---

# Check Column Naming

load the extracted_features.csv

In [5]:
import pandas as pd
df_raw_path = "extracted_features.csv"
features_df = pd.read_csv(df_raw_path)
features_df.head(10)

Unnamed: 0,4095,breed,0,1,2,3,4,5,6,7,...,4085,4086,4087,4088,4089,4090,4091,4092,4093,4094
0,0.0,boston_bull,7.031771,11.45144,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,dingo,0.0,0.0,0.0,0.0,0.0,5.701201,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.935082,0.0,2.509676
2,0.0,pekinese,0.0,0.0,0.0,1.967434,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.950961
3,0.0,bluetick,0.0,0.0,0.0,0.0,16.433384,10.652593,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,golden_retriever,5.882503,9.397577,0.0,0.0,0.0,7.932431,0.0,0.0,...,0.0,0.0,0.0,0.0,3.478842,0.0,0.0,5.857906,2.223822,0.0
5,0.0,bedlington_terrier,0.0,0.0,0.0,8.305769,0.0,0.86016,0.0,0.0,...,0.990074,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,bedlington_terrier,0.0,0.0,0.0,20.118141,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,borzoi,0.0,0.0,0.0,10.339137,0.0,10.910535,0.0,2.537553,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.214506
8,0.0,basenji,22.785442,0.0,0.0,0.0,0.0,10.543952,1.165346,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.53432,0.0,0.0
9,0.0,scottish_deerhound,0.0,0.0,0.0,0.0,0.0,6.732529,0.0,0.0,...,4.372693,0.0,0.0,0.0,6.906102,0.0,10.427298,0.0,0.0,0.0


It seems the breed column isn't the first column.

For easier reference let's put the breed column first and the feature columns following.

* Ensure that the breed column is placed first.
* Correctly label and order the feature column.

In [6]:
import pandas as pd
print(features_df.head())
print(features_df.columns)


   4095             breed         0          1    2         3          4  \
0   0.0       boston_bull  7.031771  11.451440  0.0  0.000000   0.000000   
1   0.0             dingo  0.000000   0.000000  0.0  0.000000   0.000000   
2   0.0          pekinese  0.000000   0.000000  0.0  1.967434   0.000000   
3   0.0          bluetick  0.000000   0.000000  0.0  0.000000  16.433384   
4   0.0  golden_retriever  5.882503   9.397577  0.0  0.000000   0.000000   

           5    6    7  ...  4085  4086  4087  4088      4089  4090  4091  \
0   0.000000  0.0  0.0  ...   0.0   0.0   0.0   0.0  0.000000   0.0   0.0   
1   5.701201  0.0  0.0  ...   0.0   0.0   0.0   0.0  0.000000   0.0   0.0   
2   0.000000  0.0  0.0  ...   0.0   0.0   0.0   0.0  0.000000   0.0   0.0   
3  10.652593  0.0  0.0  ...   0.0   0.0   0.0   0.0  0.000000   0.0   0.0   
4   7.932431  0.0  0.0  ...   0.0   0.0   0.0   0.0  3.478842   0.0   0.0   

       4092      4093      4094  
0  0.000000  0.000000  0.000000  
1  3.935082 

In [7]:
# Move the breed column to be the index
features_df.set_index('breed', inplace=True)

print(features_df.head())
print(features_df.columns)

                  4095         0          1    2         3          4  \
breed                                                                   
boston_bull        0.0  7.031771  11.451440  0.0  0.000000   0.000000   
dingo              0.0  0.000000   0.000000  0.0  0.000000   0.000000   
pekinese           0.0  0.000000   0.000000  0.0  1.967434   0.000000   
bluetick           0.0  0.000000   0.000000  0.0  0.000000  16.433384   
golden_retriever   0.0  5.882503   9.397577  0.0  0.000000   0.000000   

                          5    6    7          8  ...  4085  4086  4087  4088  \
breed                                             ...                           
boston_bull        0.000000  0.0  0.0   0.000000  ...   0.0   0.0   0.0   0.0   
dingo              5.701201  0.0  0.0   0.000000  ...   0.0   0.0   0.0   0.0   
pekinese           0.000000  0.0  0.0  16.114876  ...   0.0   0.0   0.0   0.0   
bluetick          10.652593  0.0  0.0   0.000000  ...   0.0   0.0   0.0   0.0   
go

Rename the columns of the DataFrame from 0 to 4095 to more descriptive names, such as Feature_0 to Feature_4095

In [8]:
features_df.columns = ['Feature_' + str(i) for i in range(len(features_df.columns))]

print(features_df.head())
print(features_df.columns)

                  Feature_0  Feature_1  Feature_2  Feature_3  Feature_4  \
breed                                                                     
boston_bull             0.0   7.031771  11.451440        0.0   0.000000   
dingo                   0.0   0.000000   0.000000        0.0   0.000000   
pekinese                0.0   0.000000   0.000000        0.0   1.967434   
bluetick                0.0   0.000000   0.000000        0.0   0.000000   
golden_retriever        0.0   5.882503   9.397577        0.0   0.000000   

                  Feature_5  Feature_6  Feature_7  Feature_8  Feature_9  ...  \
breed                                                                    ...   
boston_bull        0.000000   0.000000        0.0        0.0   0.000000  ...   
dingo              0.000000   5.701201        0.0        0.0   0.000000  ...   
pekinese           0.000000   0.000000        0.0        0.0  16.114876  ...   
bluetick          16.433384  10.652593        0.0        0.0   0.000000  .

# Check for any NaN values in the DataFrame

In [9]:
null_counts_per_column = features_df.isnull().sum()
print(null_counts_per_column)

# Check for the total number of null values in the DataFrame
total_null_values = features_df.isnull().sum().sum()
print(f"Total number of null values in the DataFrame: {total_null_values}")



Feature_0       0
Feature_1       0
Feature_2       0
Feature_3       0
Feature_4       0
               ..
Feature_4091    0
Feature_4092    0
Feature_4093    0
Feature_4094    0
Feature_4095    0
Length: 4096, dtype: int64
Total number of null values in the DataFrame: 0


# Save the Corrected DataFrame

In [10]:
features_df.to_csv('corrected_extracted_features.csv')

# Data Exploration

Since there are 120 different breeds and a lot of data, let's reduce the complexity by sampling for a clearer visualization. Let's randomly select a manageable number of breeds and choose the most representative ones based on certain criteria like frequency or diversity.

In [12]:
from sklearn.decomposition import PCA
import pandas as pd
import plotly.express as px

# Perform PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(features_df)

# Convert the principal components into a DataFrame for easier plotting
pca_df = pd.DataFrame(data = principal_components, columns = ['Component 1', 'Component 2'])
pca_df['Breed'] = features_df.index

fig = px.scatter(pca_df, x='Component 1', y='Component 2', color='Breed',
                 hover_data=['Breed'], title='PCA Visualization of Dog Breeds')
fig.show()


# Explanation

* PCA Transformation: PCA is used to reduce the dimensionality of the data.

* Plotly Plot:  Directly use the DataFrame created from the PCA output to make the Plotly plot. The `px.scatter()` function is good for creating interactive plots and it can handle DataFrame inputs directly, using column names for axes and color encoding.

* Color Encoding: The color encoding uses the "Breed" column, which should be set based on our dataset. The index `features_df.index` is included in the PCA DataFrame as demonstrated.

This approach gives a clean, interactive PCA plot of our dog breed data, using Plotly. The interactive plot allows us to hover over points to see additional data (like the breed), and because it’s directly generated from the DataFrame, it manages both data handling and visualization efficiently.

## Split Train and Test Set

#### We will use a method called "Stratified Split": This method is particularly useful when the data might have imbalanced classes, which is common in scenarios like dog breed identification where some breeds might be more common than others. Stratified splitting ensures that each class is represented in both the training and test sets in proportion to their original distribution.

* Resetting the Index: we drop the index to transform the DataFrame so that the breed labels are no longer there and allowing the DataFrame to only contain feature data.

* We capture the breed labels to preserve their order relative to the features.

* Stratified Split: This ensures that both the training and testing datasets have the same percentage of samples of each breed as the original dataset. This is very useful for classification problems.

In [17]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Reset index to remove breeds from index
features = features_df.reset_index(drop=True)
# Store the labels
labels = features_df.index

# Perform stratified split
x_train, x_test, y_train, y_test = train_test_split(
    features,     
    labels,       
    test_size=0.2,  
    stratify=labels,
    random_state=42
)

# Print shapes of the train and test sets
print("Training set shape:", x_train.shape, y_train.shape)
print("Testing set shape:", x_test.shape, y_test.shape)

Training set shape: (8177, 4096) (8177,)
Testing set shape: (2045, 4096) (2045,)


## Train Set

Saving the train and test sets to csv, note that `y_train` and `y_test` aren't a DataFrame format, they are Pandas objects. We will have to convert them first.

In [20]:
import pandas as pd

x_train.to_csv('outputs/x_train.csv', index=False)

# Converting the y_train and y_test arrays to Series
y_train_series = pd.Series(y_train, name='Breed')
# Now save to CSV
y_train_series.to_csv('outputs/y_train.csv', index=False)

## Test Set

In [21]:
import pandas as pd

x_test.to_csv('outputs/x_test.csv', index=False)

# Converting the y_train and y_test arrays to Series
y_test_series = pd.Series(y_test, name='Breed')
# Now save to CSV
y_test_series.to_csv('outputs/y_test.csv', index=False)