# Models Informations

Scikit-learn provides several classification models that can handle categorical data. Here are a few examples:

# Categorical Naive Bayes (CategoricalNB):
This model is suitable for classification with discrete features that are categorically distributed. It's similar to MultinomialNB but can handle categorical features with more than two categories.

# Bernoulli Naive Bayes (BernoulliNB):
This model is suitable for binary features (i.e., features that have only two possible values). It's similar to MultinomialNB but can handle binary features.

# Logistic Regression (LogisticRegression):
This model is suitable for binary classification problems. It's a linear model that uses logistic function to predict the probability of the positive class.

# Decision Trees (DecisionTreeClassifier):
This model is suitable for classification problems with both categorical and numerical features. It works by recursively partitioning the data into smaller subsets based on the features.

# Random Forests (RandomForestClassifier):
This model is an ensemble of decision trees. It's suitable for classification problems with both categorical and numerical features.

# Support Vector Machines (SVC): 
This model is suitable for classification problems with both categorical and numerical features. It works by finding the hyperplane that maximally separates the classes.

# K-Nearest Neighbors (KNeighborsClassifier):
This model is suitable for classification problems with both categorical and numerical features. It works by finding the k most similar instances to a new instance and using their labels to predict the label of the new instance.

# MODEL CODE EXAMPLES

In [7]:
from sklearn.naive_bayes import CategoricalNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

class MyModel:
    def __init__(self, model_type):
        self.vectorizer = CountVectorizer()
        if model_type == "CategoricalNB":
            self.model = CategoricalNB()
        elif model_type == "BernoulliNB":
            self.model = BernoulliNB()
        elif model_type == "LogisticRegression":
            self.model = LogisticRegression()
        elif model_type == "DecisionTreeClassifier":
            self.model = DecisionTreeClassifier()
        elif model_type == "RandomForestClassifier":
            self.model = RandomForestClassifier()
        elif model_type == "SVC":
            self.model = SVC()
        elif model_type == "KNeighborsClassifier":
            self.model = KNeighborsClassifier()

    def train(self, training_data):
        names, genders = zip(*training_data)
        X = self.vectorizer.fit_transform(names)
        y = genders
        self.model.fit(X, y)

    def predict(self, name):
        X = self.vectorizer.transform([name])
        return self.model.predict(X)[0]

    def learn(self, name, gender):
        X = self.vectorizer.transform([name])
        self.model.partial_fit(X, [gender], classes=["M", "F", "NEUTRAL"])

#model = MyModel("LogisticRegression") # Instantiation

NameError: name 'CountVectorizer' is not defined

In [8]:
import pickle

file = open('train_raw_data.pkl', 'rb')
data = pickle.load(file)
file.close()


for item in data:
    print(item)

('Aaban', 'M', 1.0)
('Aabha', 'F', 1.0)
('Aabid', 'M', 1.0)
('Aabriella', 'F', 1.0)
('Aada', 'F', 1.0)
('Aadam', 'M', 1.0)
('Aadan', 'M', 1.0)
('Aadarsh', 'M', 1.0)
('Aaden', 'M', 0.998814604)
('Aadesh', 'M', 1.0)
('Aadhav', 'M', 1.0)
('Aadhavan', 'M', 1.0)
('Aadhi', 'M', 1.0)
('Aadhira', 'F', 1.0)
('Aadhvik', 'M', 1.0)
('Aadhya', 'F', 1.0)
('Aadhyan', 'M', 1.0)
('Aadi', 'M', 0.977961433)
('Aadian', 'M', 1.0)
('Aadil', 'M', 1.0)
('Aadin', 'M', 1.0)
('Aadish', 'M', 1.0)
('Aadison', 'F', 1.0)
('Aadit', 'M', 1.0)
('Aadith', 'M', 1.0)
('Aadithya', 'M', 1.0)
('Aaditri', 'F', 1.0)
('Aaditya', 'M', 1.0)
('Aadiv', 'M', 1.0)
('Aadon', 'M', 1.0)
('Aadrian', 'M', 1.0)
('Aadrika', 'F', 1.0)
('Aadrit', 'M', 1.0)
('Aadvik', 'M', 1.0)
('Aadvika', 'F', 1.0)
('Aadya', 'F', 1.0)
('Aadyn', 'M', 0.9629629629629628)
('Aafia', 'F', 1.0)
('Aafreen', 'F', 1.0)
('Aagam', 'M', 1.0)
('Aage', 'M', 1.0)
('Aagot', 'F', 1.0)
('Aahaan', 'M', 1.0)
('Aahan', 'M', 1.0)
('Aahana', 'F', 1.0)
('Aahil', 'M', 1.0)
('Aahir', 

The class distribution shows a significant imbalance between the two classes: `0` (female) has **60,307 instances**, while `1` (male) has **34,724 instances**. This imbalance can explain why the model is predominantly predicting one class (class `0`), as shown by the confusion matrix, which is not capturing any of the male instances (`1`). 

### Action Plan:

1. **Imbalance Handling:**
   - The class imbalance is likely causing the model to focus on predicting the majority class (class `0`). The accuracy of 63.37% indicates that the model is correctly predicting the majority class but completely failing to predict the minority class (`1`).

   To mitigate this, you should:
   
   - **Use `class_weight="balanced"` in the `DecisionTreeClassifier`** to account for the imbalance:
     ```python
     self.model = DecisionTreeClassifier(class_weight="balanced")
     ```
     This assigns higher weights to the minority class to force the model to learn both classes better.

   - **Try resampling techniques**:
     - **Oversampling** the minority class (`1`) or **undersampling** the majority class (`0`).
     - Libraries like `imbalanced-learn` can help with oversampling (`SMOTE`) or undersampling:
       ```bash
       pip install imbalanced-learn
       ```
       Example of using SMOTE:
       ```python
       from imblearn.over_sampling import SMOTE

       smote = SMOTE(random_state=42)
       X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
       ```

2. **Check the Model's Ability to Generalize**:
   - **Training accuracy of 100%** indicates that your model is overfitting to the training data. Decision trees tend to overfit, especially with imbalanced data.
     - Try limiting the depth of the tree:
       ```python
       self.model = DecisionTreeClassifier(max_depth=10, class_weight="balanced")
       ```
     - This reduces overfitting by forcing the tree to make more generalized splits.

3. **Alternative Models**:
   - **Logistic Regression**: Since decision trees are sensitive to data imbalances and overfitting, switching to a simpler, regularized model such as logistic regression might help improve generalization and handle imbalance more effectively. You can uncomment and use logistic regression:
     ```python
     from sklearn.linear_model import LogisticRegression
     self.model = LogisticRegression(class_weight="balanced")
     ```

4. **Evaluation Metrics Beyond Accuracy**:
   - With imbalanced data, **accuracy** is not the best measure. Instead, use metrics such as **precision**, **recall**, and **F1-score**, especially for the minority class. You can compute them with:
     ```python
     from sklearn.metrics import classification_report

     print(classification_report(y_test, y_pred))
     ```

### Updated Training Function Example:
```python
def train(self, train_raw_data):
    """Train the model with provided raw data."""
    self.train_raw_data = train_raw_data
    names, genders = zip(*train_raw_data)
    gender_map = {"M": 1, "F": 0}  # Map gender labels to numerical values
    y = np.array([gender_map[gender] for gender in genders])
    X = self.vectorizer.fit_transform(names)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Use class_weight="balanced" to address imbalance
    self.model = DecisionTreeClassifier(class_weight="balanced", max_depth=10)
    
    self.model.fit(X_train, y_train)
    y_pred = self.model.predict(X_test)
    
    # Print Accuracy and other metrics
    print("Accuracy: ", accuracy_score(y_test, y_pred))
    print("Training Accuracy: {:.2f}% ".format(self.model.score(X_train, y_train) * 100))
    print("Testing Accuracy: {:.2f}% ".format(self.model.score(X_test, y_test) * 100))
    
    # Confusion matrix and classification report
    print(f"Confusion matrix: \n{confusion_matrix(y_test, y_pred)}")
    print(classification_report(y_test, y_pred))
    
    self.is_trained = True
```

Try these suggestions and let me know how the performance changes!

In [9]:
# # Dumping csv data to pkl

# import pandas as pd
# import pickle

# # Load data from name_gender.csv
# data = pd.read_csv('name_gender.csv')

# # Save data to train_raw_data.pkl
# with open('gender_data.pkl', 'wb') as f:
#     pickle.dump(data, f)