# Gender Classification in Film Dialogues

This notebook explores whether it's possible to predict the gender of the lead actor in a movie based on dialogue statistics and metadata. We'll use several classification methods from machine learning and compare their performance on this task.


## 1. Setup and Imports

In this section, we import all the required libraries for:

- Data handling (`pandas`, `numpy`)
- Visualization (`matplotlib`, `seaborn`)
- Machine Learning (`scikit-learn`)

We also set a random seed to ensure reproducible results.

In [None]:
from IPython.core.pylabtools import figsize
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score
from sklearn import preprocessing as skl_pre
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Optional: suppress SettingWithCopyWarning
pd.options.mode.chained_assignment = None

# Fix the random seed for reproducibility
np.random.seed(1)


## 📥 2. Load the Data

We load the training and test datasets from the `data/` folder. These contain features extracted from movie scripts such as word counts by gender and other metadata. 

This will allow us to build models that predict the gender of the lead actor based on the dialogue distribution and contextual details.


In [6]:
# Load the Data
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")

# Print shape to confirm successful load
print("Train shape:", train.shape)
print("Test shape:", test.shape)


Train shape: (1039, 14)
Test shape: (387, 13)
