# Data Cleaning Script
This notebook is used for cleaning and preprocessing datasets for TV Shows Friends, HIMYM, and TBBT.
First step: Import necessary Python libraries for data processing and cleaning.

In [1]:
import pandas as pd

# Data: Load, overview, convert
Load the datasets for Friends, HIMYM, and TBBT into pandas DataFrames. Check the basic structure of each dataset to identify potential issues. 
Drop rows with missing values and unnecesary columns.
Standardize the date format for easier analysis.

In [2]:
friends = pd.read_csv('friends_combined.csv')
himym = pd.read_csv('HIMYM_combined.csv')
tbbt = pd.read_csv('TBBT_combined.csv')

friends = friends.dropna()
himym = himym.dropna()
tbbt = tbbt.dropna()

himym = himym.drop(columns=['episode'])
tbbt = tbbt.drop(columns=['episode_name'])
friends = friends.drop(columns=['episode_name'])

friends['original_air_date'] = pd.to_datetime(friends['original_air_date'].str.replace('.', ''), format="%d %b %Y")
himym['original_air_date'] = pd.to_datetime(himym['original_air_date'].str.replace('.', ''), format="%d %b %Y")
tbbt['original_air_date'] = pd.to_datetime(tbbt['original_air_date'].str.replace('.', ''), format="%d %b %Y")

## Filter and clean data
For consistency and to allow functional comparisons across the series, we are limiting our analysis to the first six seasons of all shows. 
This decision is made because season 7 of HIMYM is incomplete, and filtering to six seasons ensures fairness in the analysis.
Strip whitespace and remove content inside parentheses from the dialogue/line columns.

In [3]:
friends = friends[friends['season'] <= 6].copy()
himym = himym[himym['season'] <= 6].copy()
tbbt = tbbt[tbbt['season'] <= 6].copy()

print(friends.groupby('season')['episode_num'].nunique())
print(himym.groupby('season')['episode_num'].nunique())
print(tbbt.groupby('season')['episode_num'].nunique())


friends['line'] = friends['line'].str.strip().str.replace(r'\(.*?\)', '', regex=True)
himym['line'] = himym['line'].str.strip().str.replace(r'\(.*?\)', '', regex=True)
tbbt['dialogue'] = tbbt['dialogue'].str.strip().str.replace(r'\(.*?\)', '', regex=True)

season
1    24
2    24
3    24
4    23
5    23
6    24
Name: episode_num, dtype: int64
season
1    22
2    20
3    16
4    24
5    24
6    24
Name: episode_num, dtype: int64
season
1    17
2    23
3    23
4    24
5    24
6    24
Name: episode_num, dtype: int64


## Add Word Count Column
Calculate the number of words in each dialogue/line and add it as a new column.

In [4]:
friends['word_count'] = friends['line'].str.split().str.len()
himym['word_count'] = himym['line'].str.split().str.len()
tbbt['word_count'] = tbbt['dialogue'].str.split().str.len()

## Filter Main Characters
Keep only rows with dialogues from main characters in the datasets.

In [5]:
friends_character = ['rachel', 'ross', 'chandler', 'monica', 'joey', 'phoebe']
friends = friends[friends['character'].str.lower().isin(friends_character)].copy()

valid_person_scene = ['sheldon', 'leonard', 'penny', 'howard', 'raj', 'amy', 'bernadette']
tbbt = tbbt[tbbt['person_scene'].str.lower().isin(valid_person_scene)].copy()

himym_characters = ['ted', 'marshall', 'lily', 'robin', 'barney']
himym = himym[himym['name'].str.lower().isin(himym_characters)].copy()

## Export Cleaned Data
Save the cleaned datasets to new CSV files for further analysis.

In [6]:
friends.to_csv('FINAL_friends_clean.csv', index=False)
himym.to_csv('FINAL_himym_clean.csv', index=False)
tbbt.to_csv('FINAL_tbbt_clean.csv', index=False)