# Assignment 3 - Text Analysis
An explanation this assignment could be found in the .pdf explanation document


## Materials to review for this assignment
<h4>From Moodle:</h4> 
<h5><u>Review the notebooks regarding the following python topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Working with strings</b> (tutorial notebook)<br/>
&#x2714; <b>Text Analysis</b> (tutorial notebook)<br/>
&#x2714; <b>Hebrew text analysis tools (tokenizer, wordnet)</b> (moodle example)<br/>
&#x2714; <b>(brief review) All previous notebooks</b><br/>
</div> 
<h5><u>Review the presentations regarding the following topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Text Analysis</b> (lecture presentation)<br/>
&#x2714; <b>(brief review) All other presentations</b><br/>
</div>

## Personal Details:

In [2]:
# Details Student 1:
# Name : Tal Tubul
# I.D : 208835355
# Email : taltub123@gmail.com

## Preceding Step - import modules (packages)
This step is necessary in order to use external modules (packages). <br/>

In [3]:
# --------------------------------------
import pandas as pd
import numpy as np
# --------------------------------------


# --------------------------------------
# ------------- visualizations:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
# --------------------------------------


# ---------------------------------------
import sklearn
from sklearn import preprocessing, metrics, pipeline, model_selection, feature_extraction 
from sklearn import naive_bayes, linear_model, svm, neural_network, neighbors, tree
from sklearn import decomposition, cluster

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV 
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, r2_score, silhouette_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder

from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import Perceptron, SGDClassifier
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.tree import DecisionTreeClassifier
# ---------------------------------------


# ----------------- output and visualizations: 
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)
# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
pd.pandas.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# ---------------------------------------

### Text analysis and String manipulation imports:

In [4]:
# --------------------------------------
# --------- Text analysis and Hebrew text analysis imports:
# vectorizers:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# regular expressions:
import re
# --------------------------------------

### (optional) Hebrew text analysis - WordNet (for Hebrew)
Note: the WordNet is not a must

#### (optional) Only if you didn't install Wordnet (for Hebrew) use:

In [5]:
# word net installation:

# unmark if you want to use and need to install
# !pip install wn
# !python -m wn download omw-he:1.4

In [6]:
# word net import:

# unmark if you want to use:
import wn

### (optional) Hebrew text analysis - hebrew_tokenizer (Tokenizer for Hebrew)
Note: the hebrew_tokenizer is not a must

#### (optional) Only if you didn't install hebrew_tokenizer use:

In [7]:
# Hebrew tokenizer installation:

# unmark if you want to use and need to install:
# !pip install hebrew_tokenizer

In [8]:
# Hebrew tokenizer import:

# unmark if you want to use:
import hebrew_tokenizer as ht

c:\Users\Tal\Desktop\לימודים\למידת מכונה\מטלה 3


### Reading input files
Reading input files for train annotated corpus (raw text data) corpus and for the test corpus

In [9]:
train_filename = 'annotated_corpus_for_train.csv'
test_filename  = 'corpus_for_test.csv'
df_train = pd.read_csv(train_filename, index_col=None, encoding='utf-8')
df_test  = pd.read_csv(test_filename, index_col=None, encoding='utf-8')

In [10]:
df_train.head(8)
df_train.shape

Unnamed: 0,story,gender
0,"כשחבר הזמין אותי לחול, לא באמת חשבתי שזה יקרה,...",m
1,לפני שהתגייסתי לצבא עשיתי כל מני מיונים ליחידו...,m
2,מאז שהתחילו הלימודים חלומו של כל סטודנט זה הפנ...,f
3,"כשהייתי ילד, מטוסים היה הדבר שהכי ריתק אותי. ב...",m
4,‏הייתי מדריכה בכפר נוער ומתאם הכפר היינו צריכי...,f
5,לפני כ3 חודשים טסתי לרומא למשך שבוע. טסתי במטו...,f
6,אני כבר שנתיים נשוי והשנה אני ואישתי סוף סוף י...,m
7,השנה התחלנו שיפוץ בדירה שלנו בתל אביב. הדירה ה...,f


(753, 2)

In [11]:
df_test.head(3)
df_test.shape

Unnamed: 0,test_example_id,story
0,0,כל קיץ אני והמשפחה נוסעים לארצות הברית לוס אנג...
1,1,"הגעתי לשירות המדינה אחרי שנתיים כפעיל בתנועת ""..."
2,2,אחת האהבות הגדולות שלי אלו הכלבים שלי ושל אישת...


(323, 2)

### Your implementation:
Write your code solution in the following code-cells

##### Function: translate_df_to_english(df)
Translates the content in the 'story' column of a given DataFrame from Hebrew to English characters, creating a new DataFrame with translated text. Useful for converting non-English text to English for analysis.

**Input:** 
df (DataFrame): DataFrame with a 'story' column containing Hebrew text.

**Output:** 
Returns a new DataFrame with 'story' column content translated to English.

In [12]:
def translate_df_to_english(df):
    
    english_df = df.copy()
    translation_table = str.maketrans(
        "אבגדהוזחטיכלמנסעפצקרשתםףץןך",
        "abgdhwzxviklmnsypcqretMFCNX"
    )
    for row_index in range(0,len(english_df)):
        
        english_df.loc[row_index, 'story'] = english_df.loc[row_index, 'story'].translate(translation_table)

    return english_df


##### Function: create_count_vectorizer(ngram_range_index)
Creates a CountVectorizer object with a specified ngram range.

**Input:** 
ngram_range_index: An integer indicating the desired ngram range.

**Output:** 
A CountVectorizer object configured with the specified ngram range.

In [13]:
def create_count_vectorizer(ngram_range_index):
    return CountVectorizer(ngram_range=(ngram_range_index, ngram_range_index)) 

##### Function: create_Tfidf_Vectorizer(ngram_range_index)
Creates a TfidfVectorizer object with a specified ngram range.

**Input:** 
ngram_range_index: An integer indicating the desired ngram range.

**Output:** 
A TfidfVectorizer object configured with the specified ngram range.

In [14]:
def create_Tfidf_Vectorizer(ngram_range_index):
    return TfidfVectorizer(ngram_range=(ngram_range_index, ngram_range_index)) 

##### Function: normailze(X_train, X_test)
Takes the training and test data matrices as input, normalizes them using the StandardScaler, and returns the normalized versions of the matrices.

**Input:** 
1. X_train: Training data matrix (sparse or dense).
2. X_test: Test data matrix (sparse or dense).

**Output:** 
1. X_train_normalized: Normalized training data matrix.
2. X_test_normalized: Normalized test data matrix.

In [15]:
def normailze(X_train, X_test):
    
    scalar = StandardScaler()

    X_train_normalized = scalar.fit_transform(X_train.toarray())
    X_test_normalized = scalar.transform(X_test.toarray())

    return X_train_normalized, X_test_normalized

##### Function: fit(clf, X_train_normalized, y_train)
 Fits a given classifier to the normalized training data and their corresponding target labels. It's a convenient wrapper to call the fit method of the classifier.

**Input:** 
1. clf: The classifier model.
2. X_train_normalized: Normalized training data.
3. y_train: Target label.

**Output:** 
non output

In [16]:
def fit(clf, X_train_normalized, y_train):

    clf.fit(X_train_normalized, y_train)


##### Function: predict(clf, X_test_normalized)
Takes the training and test data matrices as input, normalizes them using the StandardScaler, and returns the normalized versions of the matrices.

**Input:** 
1. clf: The trained classifier model.
2. X_test_normalized: Normalized test data.

**Output:** 
The function returns an array of predicted labels for the given test data.

In [17]:
def predict(clf, X_test_normalized):

    return clf.predict(X_test_normalized)


##### Function: evaluate_accuracy(y_test, y_predicted)
Calculates the accuracy of a classification model by comparing the predicted labels with the actual labels of the test data.

**Input:** 
1. y_test: The true labels of the test data.
2. y_predicted: The predicted labels of the test data.

**Output:** 
The function returns the accuracy of the model, which is the proportion of correctly predicted labels among all the test samples.

In [18]:
def evaluate_accuracy(y_test, y_predicted):

    return  np.mean(y_predicted == y_test)


##### Function: f1_score_calc(testDf, y_pred)
Calculates the average F1 score.

**Input:** 
1. testDf: A DataFrame containing the test examples and their true gender labels.
2. y_pred: The predicted gender labels of the test examples.

**Output:** 
The function returns the average F1 score.

In [19]:
def f1_score_calc(testDf, y_pred):
    f1_male = f1_score(testDf.gender, y_pred, pos_label='m')
    f1_female = f1_score(testDf.gender, y_pred, pos_label='f')
    average_f1 = (f1_male + f1_female) / 2

    return average_f1

##### Function: create_Perceptron_model(X_train_normalized, X_test_normalized, y_train, y_test, testDf)
 Creates and evaluates a Perceptron model.

**Input:** 
1. X_train_normalized: Normalized training data.
2. X_test_normalized: Normalized test data (features).
3. y_train: True gender labels.
4. y_test: True gender labels.
5. testDf: DataFrame containing the test examples and their true gender labels.

**Output:** 
1. Average F1 score.
2. Accuracy of the Perceptron model.
3. The best alpha index for the model.

In [39]:
def create_Perceptron_model(X_train_normalized, X_test_normalized, y_train, y_test, testDf):
    max_alpha = {
        "average_f1_perceptron_model" : 0,
        "perceptron_model_accuracy" : 0,
        "alpha_index" : ""
    }
    for alpha_index in [0.0001, 0.005, 0.01]:
        perceptron_model = Perceptron(alpha=alpha_index, max_iter=20)
        fit(perceptron_model, X_train_normalized, y_train)
        y_pred_perceptron_model = predict(perceptron_model, X_test_normalized)
        perceptron_model_accuracy = evaluate_accuracy(y_test, y_pred_perceptron_model)
        average_f1_perceptron_model = f1_score_calc(testDf, y_pred_perceptron_model)
        if average_f1_perceptron_model > max_alpha["average_f1_perceptron_model"]:
            max_alpha["average_f1_perceptron_model"] = average_f1_perceptron_model
            max_alpha["perceptron_model_accuracy"] = perceptron_model_accuracy
            max_alpha["alpha_index"] = alpha_index
    return max_alpha

##### Function: create_LinearSVC_model(X_train_normalized, X_test_normalized, y_train, y_test, testDf)
 Creates and evaluates a Linear Support Vector Classification (LinearSVC) model.

**Input:** 
1. X_train_normalized: Normalized training data.
2. X_test_normalized: Normalized test data (features).
3. y_train: True gender labels.
4. y_test: True gender labels.
5. testDf: DataFrame containing the test examples and their true gender labels.

**Output:** 
1. Average F1 score.
2. Accuracy of the Perceptron model.
3. The best C index for the model.

In [40]:
def create_LinearSVC_model(X_train_normalized, X_test_normalized, y_train, y_test, testDf):
    max_C = {
        "average_f1_LinearSVC_model" : 0,
        "LinearSVC_model_accuracy" : 0,
        "C_index" : ""
    }
    for C_index in [0.0001, 0.005, 0.01]:
        LinearSVC_model = LinearSVC(C= 3, max_iter= 20)
        fit(LinearSVC_model, X_train_normalized, y_train)
        y_pred_LinearSVC_model = predict(LinearSVC_model, X_test_normalized)
        LinearSVC_model_accuracy = evaluate_accuracy(y_test, y_pred_LinearSVC_model)
        average_f1_LinearSVC_model = f1_score_calc(testDf, y_pred_LinearSVC_model)
        if average_f1_LinearSVC_model > max_C["average_f1_LinearSVC_model"]:
            max_C["average_f1_LinearSVC_model"] = average_f1_LinearSVC_model
            max_C["LinearSVC_model_accuracy"] = LinearSVC_model_accuracy
            max_C["C_index"] = C_index
    return max_C

##### Function: create_SGDClassifier_model(X_train_normalized, X_test_normalized, y_train, y_test, testDf)
 Creates and evaluates a Stochastic Gradient Descent (SGD) Classifier model.

**Input:** 
1. X_train_normalized: Normalized training data.
2. X_test_normalized: Normalized test data (features).
3. y_train: True gender labels.
4. y_test: True gender labels.
5. testDf: DataFrame containing the test examples and their true gender labels.

**Output:** 
1. Average F1 score.
2. Accuracy of the Perceptron model.
3. The best loss index for the model.

In [47]:
def create_SGDClassifier_model(X_train_normalized, X_test_normalized, y_train, y_test, testDf):
    max_loss = {
        "average_f1_SGDClassifier_model" : 0,
        "SGDClassifier_model_accuracy" : 0,
        "loss_index" : ""
    }
    for loss_index in ['hinge', 'log_loss', 'log', 'modified_huber', 'squared_hinge', 'perceptron', 'squared_error', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive']:
        SGDClassifier_model = SGDClassifier(loss=loss_index, penalty='l1', alpha=0.001, random_state=42, max_iter=10, tol= None)
        fit(SGDClassifier_model, X_train_normalized, y_train)
        y_pred_SGDClassifier_model = predict(SGDClassifier_model, X_test_normalized)
        SGDClassifier_model_accuracy = evaluate_accuracy(y_test, y_pred_SGDClassifier_model)
        average_f1_SGDClassifier_model = f1_score_calc(testDf, y_pred_SGDClassifier_model)
        if average_f1_SGDClassifier_model > max_loss["average_f1_SGDClassifier_model"]:
            max_loss["average_f1_SGDClassifier_model"] = average_f1_SGDClassifier_model
            max_loss["SGDClassifier_model_accuracy"] = SGDClassifier_model_accuracy
            max_loss["loss_index"] = loss_index
    return max_loss

##### Function: split_to_trainDF_testDF()
splits the provided DataFrame df_train into training and test DataFrames. 

**Note:** I splitted the train_df to perform experiments and test different models before I run the model on the test_df

**Input:** 
None 

**Output:** 
1. trainDf: DataFrame containing training examples.
2. testDf: DataFrame containing test examples.

In [23]:
def split_to_trainDF_testDF():
    
    english_df_train = translate_df_to_english(df_train)

    indexed_df_train = english_df_train.copy()
    indexed_df_train["id"] = indexed_df_train.index
    
    trainDf=indexed_df_train[indexed_df_train["id"]%5!=0]
    testDf=indexed_df_train[indexed_df_train["id"]%5==0]

    return trainDf, testDf

##### Function: max_f1_score_for_vectorizer(vectorizer, vectorizer_type ,trainDf, testDf, y_train, y_test, max_average_f1_dict, ngram_range_index)
Calculates the maximum F1 score achieved among different models (Perceptron, LinearSVC, SGDClassifier) for a given vectorizer type and n-gram range index.

The function performs the following steps:

* Transforms text data using the provided vectorizer.
* Normalizes the transformed data.
* Calls the functions to create models for Perceptron, LinearSVC, and SGDClassifier, calculates their F1 scores and accuracy.
* Compares the F1 scores of these models and updates the max_average_f1_dict if a higher F1 score is found.
* Returns the updated max_average_f1_dict.

**Input:** 
1. vectorizer: Vectorizer object for feature extraction.
2. vectorizer_type: Type of the vectorizer used (for updating the dictionary).
3. trainDf: DataFrame containing training examples.
4. testDf: DataFrame containing test examples.
5. y_train: Labels for training examples.
6. y_test: Labels for test examples.
7. max_average_f1_dict: Dictionary to store maximum F1 score information.
8. ngram_range_index: Index for n-gram range.

**Output:** 
1. max_average_f1_dict: Updated dictionary with maximum F1 score information.

In [48]:
def max_f1_score_for_vectorizer(vectorizer, vectorizer_type ,trainDf, testDf, y_train, y_test, max_average_f1_dict, ngram_range_index):

    X_train = vectorizer.fit_transform(trainDf['story'])
    X_test = vectorizer.transform(testDf['story'])

    X_train_normalized, X_test_normalized = normailze(X_train, X_test)

    perceptron_model_dict = create_Perceptron_model(X_train_normalized, X_test_normalized, y_train, y_test, testDf)
    LinearSVC_model_dict = create_LinearSVC_model(X_train_normalized, X_test_normalized, y_train, y_test, testDf)
    SGDClassifier_model_dict = create_SGDClassifier_model(X_train_normalized, X_test_normalized, y_train, y_test, testDf)  
    
    max_f1_score =  max(perceptron_model_dict["average_f1_perceptron_model"], LinearSVC_model_dict["average_f1_LinearSVC_model"], SGDClassifier_model_dict["average_f1_SGDClassifier_model"])

    if max_f1_score > max_average_f1_dict['average_f1_score']:
        max_average_f1_dict['vectorizer_type'] = vectorizer_type
        max_average_f1_dict['ngram_range_index'] = ngram_range_index
        
        if max_f1_score == perceptron_model_dict["average_f1_perceptron_model"]:
            max_average_f1_dict['model_type'] = 'Perceptron'
            max_average_f1_dict['average_f1_score'] = perceptron_model_dict["average_f1_perceptron_model"]
            max_average_f1_dict['evaluate_accuracy'] = perceptron_model_dict["perceptron_model_accuracy"]
            max_average_f1_dict['hyper_parameter'] = perceptron_model_dict["alpha_index"]

        elif max_f1_score == LinearSVC_model_dict["average_f1_LinearSVC_model"]:
            max_average_f1_dict['model_type'] = 'LinearSVC'
            max_average_f1_dict['average_f1_score'] = LinearSVC_model_dict["average_f1_LinearSVC_model"]
            max_average_f1_dict['evaluate_accuracy'] = LinearSVC_model_dict["LinearSVC_model_accuracy"]
            max_average_f1_dict['hyper_parameter'] = LinearSVC_model_dict["C_index"]

        else:
            max_average_f1_dict['model_type'] = 'SGDClassifier'
            max_average_f1_dict['average_f1_score'] = SGDClassifier_model_dict["average_f1_SGDClassifier_model"]
            max_average_f1_dict['evaluate_accuracy'] = SGDClassifier_model_dict["SGDClassifier_model_accuracy"]
            max_average_f1_dict['hyper_parameter'] = SGDClassifier_model_dict["loss_index"]
    
    return max_average_f1_dict
    

##### Function: split_to_trainDF_testDF()
This function finds the best performing model among different combinations of n-gram ranges and vectorizer types. The function follows these steps:
* Splits the data into training and testing sets using split_to_trainDF_testDF() function.
* Defines labels for training and testing sets.
* Initializes a dictionary max_average_f1_dict to store information about the best model.
* Iterates through different n-gram ranges (1 to 3) and for each range:
    1. Creates a CountVectorizer and a TfidfVectorizer.
    2. Calls the max_f1_score_for_vectorizer() function to determine the best model type and vectorizer type for the current n-gram range.
    3. Updates the max_average_f1_dict with the best performing model's information.
* Returns the max_average_f1_dict containing information about the best performing model.

**Input:** 
None 

**Output:** 
1. max_average_f1_dict: A dictionary containing information about the best performing model, including n-gram range, vectorizer type, model type, average F1 score, and accuracy.

In [43]:
def find_best_model():

    trainDf, testDf = split_to_trainDF_testDF()

    y_train = trainDf['gender']
    y_test = testDf['gender']


    max_average_f1_dict = {
        'ngram_range_index' : 0,
        'vectorizer_type' : '',
        'model_type' : '',
        'average_f1_score' : 0,
        'evaluate_accuracy' : 0,
        'hyper_parameter' : 0
    }

    for ngram_range_index in range(1,4):
        count_vectorizer = create_count_vectorizer(ngram_range_index)
        max_average_f1_dict = max_f1_score_for_vectorizer(count_vectorizer, 'count_vectorizer' ,trainDf, testDf, y_train, y_test, max_average_f1_dict, ngram_range_index)

        Tfidf_Vectorizer = create_Tfidf_Vectorizer(ngram_range_index)
        max_average_f1_dict = max_f1_score_for_vectorizer(Tfidf_Vectorizer, 'Tfidf_Vectorizer' ,trainDf, testDf, y_train, y_test, max_average_f1_dict, ngram_range_index)

    return max_average_f1_dict

In [49]:
#Calls to find_best_model() to find the best model to use for the test_df
best_model_dict = find_best_model() #BE CEARFUL!!! This function running for 18 minitus if you want to save your time run this command instead :
""" 
best_model_dict = {
        'ngram_range_index' : 2,
        'vectorizer_type' : 'count_vectorizer',
        'model_type' : 'SGDClassifier',
        'average_f1_score' : 0.7276233128071158,
        'evaluate_accuracy' : 0.8079470198675497,
        'hyper_parameter' : "hinge"
    }
"""
# Translate both training and testing data to English
english_df_train = translate_df_to_english(df_train)
english_df_test = translate_df_to_english(df_test)

# Prepare the target labels for training
y_train = english_df_train['gender']

# Choose the appropriate vectorizer based on the best model's vectorizer type
if best_model_dict['vectorizer_type'] == 'count_vectorizer':
    vectorizer = create_count_vectorizer(best_model_dict['ngram_range_index'])
else:
    vectorizer = create_Tfidf_Vectorizer(best_model_dict['ngram_range_index'])

# Transform text data into numerical features
X_train = vectorizer.fit_transform(english_df_train['story'])
X_test = vectorizer.transform(english_df_test['story'])

# Normalize the feature data
X_train_normalized, X_test_normalized = normailze(X_train, X_test)

# Choose the appropriate model based on the best model's type
if best_model_dict['model_type'] == 'Perceptron':
    model = Perceptron(alpha=best_model_dict['hyper_parameter'], max_iter=20)
elif best_model_dict['model_type'] == 'LinearSVC':
    model = LinearSVC(C=best_model_dict['hyper_parameter'], max_iter= 20)
else:
    model = SGDClassifier(loss=best_model_dict['hyper_parameter'], penalty='l1', alpha=0.001, random_state=42, max_iter=10, tol= None)

# Train the chosen model
fit(model, X_train_normalized, y_train)

# Predict gender using the trained model
y_pred_model = predict(model, X_test_normalized)

# Select only the 'test_example_id' and 'gender' columns for the final output DataFrame
df_test['gender'] = y_pred_model
columns = ['test_example_id', 'gender']
df_predicted = df_test[columns]


In [50]:

# Display the best model information
print("The best model that I found is:")
print(f"Vectorizer type: {best_model_dict['vectorizer_type']}")
print(f"Ngram range index: {best_model_dict['ngram_range_index']}")
print(f"Model type: {best_model_dict['model_type']}")
print(f"Hyper parameter value: {best_model_dict['hyper_parameter']}")
print(f"Average f1 score: {best_model_dict['average_f1_score']}")
print(f"Evaluate accuracy: {best_model_dict['evaluate_accuracy']}\n")

# Calculate and display the predicted model results
y_pred_counts = np.unique(y_pred_model, return_counts=True)
print(f"The y predicted model results are:")
print(f"Gender 'f' count: {y_pred_counts[1][0]}")
print(f"Gender 'm' count: {y_pred_counts[1][1]}")
print("The y_pred_model results: ")
y_pred_model

The best model that I found is:
Vectorizer type: count_vectorizer
Ngram range index: 2
Model type: SGDClassifier
Hyper parameter value: hinge
Average f1 score: 0.7276233128071158
Evaluate accuracy: 0.8079470198675497

The y predicted model results are:
Gender 'f' count: 43
Gender 'm' count: 280
The y_pred_model results: 


array(['m', 'm', 'm', 'f', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm',
       'm', 'f', 'm', 'm', 'm', 'f', 'm', 'm', 'm', 'm', 'm', 'm', 'f',
       'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'f', 'm', 'm', 'm',
       'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'f', 'm',
       'm', 'f', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'f',
       'm', 'm', 'f', 'm', 'm', 'm', 'm', 'm', 'm', 'f', 'm', 'm', 'm',
       'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'f', 'm', 'm', 'm',
       'm', 'm', 'm', 'f', 'm', 'm', 'm', 'm', 'f', 'm', 'm', 'm', 'm',
       'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'f', 'm', 'm', 'm', 'm',
       'm', 'm', 'm', 'm', 'f', 'm', 'm', 'm', 'f', 'm', 'f', 'm', 'm',
       'm', 'm', 'm', 'm', 'f', 'f', 'm', 'm', 'f', 'm', 'm', 'm', 'f',
       'f', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'f', 'm', 'm', 'm', 'm',
       'm', 'm', 'm', 'm', 'm', 'm', 'm', 'f', 'm', 'f', 'm', 'm', 'm',
       'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm

#### For the purpose of comparison, I built a model without hyperparameters of the SGDClassifier model type and without normalization to show that **the accuracy of the model has decreased.**

In [59]:
#Splits the df_train to train and test
trainDf, testDf = split_to_trainDF_testDF()

y_train = trainDf['gender']
y_test = testDf['gender']
#Create vectorizer
vec = create_count_vectorizer(2)

# Transform text data into numerical features
x_train = vectorizer.fit_transform(trainDf['story'])
x_test = vectorizer.transform(testDf['story'])

#Create SGDClassifier model without the hyper parameter
comparison_model = SGDClassifier()

# Train the chosen model
fit(comparison_model, x_train, y_train)

# Predict gender using the trained model
y_pred_comparison_model = predict(comparison_model, x_test)

#Calc f1_score and accuracy of the model
comparison_model_accuracy = evaluate_accuracy(y_test, y_pred_comparison_model)
comparison_model_f1_score = f1_score_calc(testDf, y_pred_comparison_model)

print(f"Now we can see that the f1_score of the comparison_model decreased to {comparison_model_f1_score} instead {best_model_dict['average_f1_score']}")
print(f"And the accuracy of the comparison_model decreased to {comparison_model_accuracy} instead {best_model_dict['evaluate_accuracy']}")

Now we can see that the f1_score of the comparison_model decreased to 0.6177215189873418 instead 0.7276233128071158
And the accuracy of the comparison_model decreased to 0.7417218543046358 instead 0.8079470198675497


### Save output to csv (optional)
After you're done save your output to the 'classification_results.csv' csv file.<br/>
We assume that the dataframe with your results contain the following columns:
* column 1 (left column): 'test_example_id'  - the same id associated to each of the test stories to be predicted.
* column 2 (right column): 'predicted_category' - the predicted gender value for each of the associated story. 

Assuming your predicted values are in the `df_predicted` dataframe, you should save you're results as following:

In [None]:
df_predicted.to_csv('classification_results.csv',index=False)