# Quest 2: Data Cleaning with Pandas

This notebook walks through the fundamental steps of data cleaning using the Pandas library. We will start with a messy, raw dataset and clean it by fixing capitalization, handling missing values, and correcting data types.

In [2]:
# Explanation:
# First, we need to import our main tool, the Pandas library.
# By convention, we give it the short nickname 'pd' to make our code easier to type.
import pandas as pd

# We'll create a sample messy dataset using a Python dictionary.
# This simulates real-world data with common problems.
data = {
    'title': ['Movie A', 'Movie B', 'Movie C', 'movie d', 'Movie E'],
    'genre': ['Action', 'Comedy', None, 'Action', 'Drama'],
    'rating': [7, 8, 5, 7, '9'],
    'watch_count': [1500, 2300, 1200, 1500, '3000']
}

# Now, we load that data into a Pandas DataFrame.
# A DataFrame is like a smart spreadsheet that we can manipulate with code.
df = pd.DataFrame(data)

# Finally, we display the first 5 rows of our DataFrame to see its initial, messy state.
df.head()

Unnamed: 0,title,genre,rating,watch_count
0,Movie A,Action,7,1500
1,Movie B,Comedy,8,2300
2,Movie C,,5,1200
3,movie d,Action,7,1500
4,Movie E,Drama,9,3000


In [3]:
# Explanation:
# We notice that 'movie d' in the 'title' column is lowercase, which is inconsistent.
# To fix this, we select the 'title' column (df['title']) and apply the .str.title() method.
# This method capitalizes the first letter of every word in the column.
# We then assign the cleaned column back to its original place in the DataFrame.
df['title'] = df['title'].str.title()

# Display the DataFrame again to confirm the fix.
df.head()

Unnamed: 0,title,genre,rating,watch_count
0,Movie A,Action,7,1500
1,Movie B,Comedy,8,2300
2,Movie C,,5,1200
3,Movie D,Action,7,1500
4,Movie E,Drama,9,3000


In [4]:
# Explanation:
# The 'genre' for 'Movie C' is missing (shown as 'None' or 'NaN').
# AI models can't work with missing data, so we must handle it.
# We'll use the .fillna() method to replace any missing values in the 'genre' column
# with the placeholder text 'Unknown'.
df['genre'] = df['genre'].fillna('Unknown')

# Display the DataFrame to confirm the missing value has been filled.
df.head()

Unnamed: 0,title,genre,rating,watch_count
0,Movie A,Action,7,1500
1,Movie B,Comedy,8,2300
2,Movie C,Unknown,5,1200
3,Movie D,Action,7,1500
4,Movie E,Drama,9,3000


In [5]:
# Explanation:
# Some numbers, like the rating '9', are stored as text instead of as actual numbers.
# The model needs these to be numeric to perform calculations.
# We use the pd.to_numeric() function to convert the entire 'rating' and 'watch_count' columns
# into a proper numeric data type.
df['rating'] = pd.to_numeric(df['rating'])
df['watch_count'] = pd.to_numeric(df['watch_count'])

# Displaying the DataFrame won't show a visible change, but the type is now correct.
df.head()

Unnamed: 0,title,genre,rating,watch_count
0,Movie A,Action,7,1500
1,Movie B,Comedy,8,2300
2,Movie C,Unknown,5,1200
3,Movie D,Action,7,1500
4,Movie E,Drama,9,3000


In [6]:
# Explanation:
# As a final check, we use the .info() method. This provides a technical summary
# of our DataFrame. It's the best way to confirm our cleaning was successful.
# The output will show us:
#   1. 'Non-Null Count' for each column, confirming we have no more missing values.
#   2. 'Dtype' (Data Type) for each column, confirming that 'rating' and 'watch_count'
#      are now a numeric type (like 'int64').
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        5 non-null      object
 1   genre        5 non-null      object
 2   rating       5 non-null      int64 
 3   watch_count  5 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes


## Conclusion & Next Steps

### Conclusion
In this notebook, we successfully performed fundamental data cleaning on a raw dataset. We addressed three common data quality issues:
* Inconsistent capitalization in the `title` column was standardized.
* Missing `null` values in the `genre` column were filled with a placeholder.
* Incorrect data types in the `rating` and `watch_count` columns were converted to a numeric format.

The resulting DataFrame has been validated with `.info()` and is now a clean, reliable dataset ready for the next stage of analysis.

### Next Steps
The next logical step for this project would be to use this clean dataset for **Exploratory Data Analysis (EDA)** to uncover insights, or to feed it into a **machine learning model** to make predictions.