# Assignment 3 - Text Analysis
An explanation this assignment could be found in the .pdf explanation document


## Materials to review for this assignment
<h4>From Moodle:</h4> 
<h5><u>Review the notebooks regarding the following python topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Working with strings</b> (tutorial notebook)<br/>
&#x2714; <b>Text Analysis</b> (tutorial notebook)<br/>
&#x2714; <b>Hebrew text analysis tools (tokenizer, wordnet)</b> (moodle example)<br/>
&#x2714; <b>(brief review) All previous notebooks</b><br/>
</div> 
<h5><u>Review the presentations regarding the following topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Text Analysis</b> (lecture presentation)<br/>
&#x2714; <b>(brief review) All other presentations</b><br/>
</div>

## Personal Details:

In [1]:
# Details Student 1: Tair Schapira - 210015848

## Preceding Step - import modules (packages)
This step is necessary in order to use external modules (packages). <br/>

In [233]:
# --------------------------------------
import pandas as pd
import numpy as np
# --------------------------------------


# --------------------------------------
# ------------- visualizations:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
# --------------------------------------


# ---------------------------------------
import sklearn
from sklearn import preprocessing, metrics, pipeline, model_selection, feature_extraction 
from sklearn import naive_bayes, linear_model, svm, neural_network, neighbors, tree
from sklearn import decomposition, cluster

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV 
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, r2_score, silhouette_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder

from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import Perceptron, SGDClassifier
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
# ---------------------------------------


# ----------------- output and visualizations: 
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)
# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
pd.pandas.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# ---------------------------------------

### Text analysis and String manipulation imports:

In [234]:
# --------------------------------------
# --------- Text analysis and Hebrew text analysis imports:
# vectorizers:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# regular expressions:
import re
# --------------------------------------

### (optional) Hebrew text analysis - WordNet (for Hebrew)
Note: the WordNet is not a must

#### (optional) Only if you didn't install Wordnet (for Hebrew) use:

In [235]:
# word net installation:

# unmark if you want to use and need to install
# !pip install wn
#!python -m wn download omw-he:1.4

In [236]:
# word net import:

# unmark if you want to use:
import wn

### (optional) Hebrew text analysis - hebrew_tokenizer (Tokenizer for Hebrew)
Note: the hebrew_tokenizer is not a must

#### (optional) Only if you didn't install hebrew_tokenizer use:

In [237]:
# Hebrew tokenizer installation:

# unmark if you want to use and need to install:
!pip install hebrew_tokenizer



In [238]:
# Hebrew tokenizer import:

# unmark if you want to use:
import hebrew_tokenizer as ht

### Reading input files
Reading input files for train annotated corpus (raw text data) corpus and for the test corpus

In [239]:
train_filename = 'annotated_corpus_for_train.csv'
test_filename  = 'corpus_for_test.csv'
df_train = pd.read_csv(train_filename, index_col=None, encoding='utf-8')
df_test  = pd.read_csv(test_filename, index_col=None, encoding='utf-8')

In [240]:
df_train.head(5)
df_train.shape

Unnamed: 0,story,gender
0,"כשחבר הזמין אותי לחול, לא באמת חשבתי שזה יקרה,...",m
1,לפני שהתגייסתי לצבא עשיתי כל מני מיונים ליחידו...,m
2,מאז שהתחילו הלימודים חלומו של כל סטודנט זה הפנ...,f
3,"כשהייתי ילד, מטוסים היה הדבר שהכי ריתק אותי. ב...",m
4,‏הייתי מדריכה בכפר נוער ומתאם הכפר היינו צריכי...,f


(753, 2)

In [242]:
df_test.head(3)
df_test.shape

Unnamed: 0,test_example_id,story
0,0,כל קיץ אני והמשפחה נוסעים לארצות הברית לוס אנג...
1,1,"הגעתי לשירות המדינה אחרי שנתיים כפעיל בתנועת ""..."
2,2,אחת האהבות הגדולות שלי אלו הכלבים שלי ושל אישת...


(323, 2)

### Your implementation:
Write your code solution in the following code-cells

# Pre-Processing of the Text



In [243]:
#the following function will clean the 'story' column of the files
#it removes punctuation and other symbols
#it also checks for any english words - since the text is in hebrew, english is unecessary
#makes sure the gender column has correct values ('m' or 'f')

def clean_text_data(df):
    #remove punctuation, symbols, digits etc. from story column and replace with a space
    df['story'] = df['story'].apply(lambda x: re.sub(r'\W', ' ', str(x)))
    df['story'] = df['story'].apply(lambda x: re.sub(r'\d', ' ', str(x)))
    
    #removal of any single lettered characters - a character surrounded by spaces
    df['story'] = df['story'].apply(lambda x: re.sub(r'\s\w\s', ' ', str(x)))
    
    #checking for any english letters/words and removing them
    df['story'] = df['story'].apply(lambda x: x.lower())
    df['story'] = df['story'].apply(lambda x: re.sub(r'\b\w+[a-zA-Z]\w*\b', ' ', str(x)))
    
    #makes sure the gender column contains only m or f - if not, the row is removed from the df
    if 'gender' in df.columns:
        df = df[(df['gender'] == 'm') | (df['gender'] == 'f')]
    return df

In [244]:
#function that tokenizes the stories from the data and returns a list of words belonging to the HEBREW group

def tokenizer(story):
    hebrew_tokens = []
    tokens = ht.tokenize(story)
    for token in tokens:
        if token[0] == 'HEBREW':
            hebrew_tokens.append(token[1].lower())
    return hebrew_tokens


In [254]:
clean_text_data(df_train)
clean_text_data(df_test)

#checking that the tokenizer function works properly
hebrew_tokens = tokenizer('היום בו דיווחתי and על גניבה של האוטו שלי')
print(hebrew_tokens)

X_train = df_train['story']
y_train = df_train['gender']

Unnamed: 0,story,gender
0,כשחבר הזמין אותי לחול לא באמת חשבתי שזה יקרה ...,m
1,לפני שהתגייסתי לצבא עשיתי כל מני מיונים ליחידו...,m
2,מאז שהתחילו הלימודים חלומו של כל סטודנט זה הפנ...,f
3,כשהייתי ילד מטוסים היה הדבר שהכי ריתק אותי ב...,m
4,הייתי מדריכה בכפר נוער ומתאם הכפר היינו צריכי...,f
...,...,...
748,אז לפני שנה בדיוק טסתי לאמסטרדם עם שני חברים ט...,m
749,שבוע שעבר העליתי באופן ספונטני רעיון לנסוע עם ...,m
750,לפני חודש עברנו לדירה בבית שמש בעקבות משפחתי ה...,m
751,החוויה אותה ארצה לשתף התרחשה לפני כמה חודשים ...,f


Unnamed: 0,test_example_id,story
0,0,כל קיץ אני והמשפחה נוסעים לארצות הברית לוס אנג...
1,1,הגעתי לשירות המדינה אחרי שנתיים כפעיל בתנועת ...
2,2,אחת האהבות הגדולות שלי אלו הכלבים שלי ושל אישת...
3,3,רגע הגיוס לצבא היה הרגע הכי משמעותי עבורי אני...
4,4,אני הגעתי לברזיל ישר מקולומביה וגם אני עשיתי ע...
...,...,...
318,318,בשנה האחרונה הרגשתי די תקוע בעבודה השגרה הפכה...
319,319,אני ואילן חברים טובים מזה שנה תמיד חלמנו לפ...
320,320,מידי יום שישי אני נוהג לנסוע בתחבורה ציבורית ס...
321,321,לפני מספר חודשים בשיא התחלואה של הגל השני עמ...


['היום', 'בו', 'דיווחתי', 'על', 'גניבה', 'של', 'האוטו', 'שלי']


# Vectorization

In [246]:
#creating TF_IDF vectors and Count vectors for the training data
#this data will be used for training and evaluating the machine learning models

def tfidf_and_count_vectors(X_train):
    # TF-IDF vectorizer
    tfidf_vect = TfidfVectorizer(tokenizer=tokenizer, min_df=3, max_features=8000, ngram_range=(1, 3), norm='l2')
    X_tfidf = tfidf_vect.fit_transform(X_train)
    
    # Count vectorizer
    count_vect = CountVectorizer(tokenizer=tokenizer, min_df=3, max_features=8000, ngram_range=(1, 3))
    X_count = count_vect.fit_transform(X_train)

    return X_tfidf, X_count

X_tfidf, X_count = tfidf_and_count_vectors(X_train)
X_test_tfidf, X_test_count = tfidf_and_count_vectors(df_test['story'])


# Cross Validation - F1 Score (Macro)

In the following code block, cross-validation is performed on each model using both TF-IDF vectors and Count Vectors. The cross-validation F1 Score (avg) is being calculated. 

The cross-validation F1 scores provide insights into how well each model performs on the given dataset using two different text feature representations. The F1 score is what helps us understand the model's ability to balance precision and recall, thus giving an indication of the model's overall classification performance. 

In [247]:
models = [
    ('LinearSVC', LinearSVC()),
    ('MLPClassifier', MLPClassifier()),
    ('Perceptron', Perceptron()),
    ('SGDClassifier', SGDClassifier()),
    ('KNeighborsClassifier', KNeighborsClassifier()),
    ('DecisionTreeClassifier', DecisionTreeClassifier())
]

for name, model in models:
    scores_tfidf = cross_val_score(model, X_tfidf, y_train, cv=10, scoring='f1_macro')
    avg_score_tfidf = np.mean(scores_tfidf)
    print(f"{name} - Cross-Validation F1 Score (macro) with TF-IDF: {avg_score_tfidf:.4f}")

for name, model in models:
    scores_count = cross_val_score(model, X_count, y_train, cv=10, scoring='f1_macro')
    avg_score_count = np.mean(scores_count)
    print(f"{name} - Cross-Validation F1 Score (macro) with Count: {avg_score_count:.4f}")

LinearSVC - Cross-Validation F1 Score (macro) with TF-IDF: 0.6024
MLPClassifier - Cross-Validation F1 Score (macro) with TF-IDF: 0.6115
Perceptron - Cross-Validation F1 Score (macro) with TF-IDF: 0.6594
SGDClassifier - Cross-Validation F1 Score (macro) with TF-IDF: 0.6815
KNeighborsClassifier - Cross-Validation F1 Score (macro) with TF-IDF: 0.6063
DecisionTreeClassifier - Cross-Validation F1 Score (macro) with TF-IDF: 0.6257
LinearSVC - Cross-Validation F1 Score (macro) with Count: 0.6666
MLPClassifier - Cross-Validation F1 Score (macro) with Count: 0.5987
Perceptron - Cross-Validation F1 Score (macro) with Count: 0.6744
SGDClassifier - Cross-Validation F1 Score (macro) with Count: 0.6749
KNeighborsClassifier - Cross-Validation F1 Score (macro) with Count: 0.4904
DecisionTreeClassifier - Cross-Validation F1 Score (macro) with Count: 0.6133


# Hyperparamater Tuning

In [248]:
models = [
    ('LinearSVC', LinearSVC(), {'C': [0.1, 1, 10]}),
    ('MLPClassifier', MLPClassifier(), {'hidden_layer_sizes': [(50,), (100,), (50, 50)]}),
    ('Perceptron', Perceptron(), {'alpha': [0.0001, 0.001, 0.01]}),
    ('SGDClassifier', SGDClassifier(), {'alpha': [0.0001, 0.001, 0.01]}),
    ('KNeighborsClassifier', KNeighborsClassifier(), {'n_neighbors': [3, 5, 7, 10], 'weights': ['uniform', 'distance'], 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}),
    ('DecisionTreeClassifier', DecisionTreeClassifier(), {'max_depth': [None, 10, 20]})
]

In [252]:
best_score = 0
best_model = None
best_model_name = ""
best_method = ""
x_test_method = None
for name, model, param_grid in models:
    print(f"Hyperparameter tuning for {name} with Count:")
    grid_search_count_method = GridSearchCV(model, param_grid=param_grid, cv=10, scoring='f1_macro')
    grid_search_count_method.fit(X_count, y_train)
    avg_score_count = grid_search_count_method.best_score_
    if avg_score_count > best_score:
        best_score = avg_score_count
        best_model = grid_search_count_method
        best_model_name = name
        best_method = "count"
        x_test_method = X_test_count
    best_params_count = grid_search_count_method.best_params_
    print(f"{name} - Best Cross-Validation F1 Score (macro) with Count: {avg_score_count:.4f}")
    print(f"Best Hyperparameters: {best_params_count}")
    print("--------------------------")

# Perform hyperparameter tuning for each model using TF-IDF vectors
for name, model, param_grid in models:
    print(f"Hyperparameter tuning for {name} with TF-IDF:")
    grid_search_tfidf_method = GridSearchCV(model, param_grid=param_grid, cv=10, scoring='f1_macro')
    grid_search_tfidf_method.fit(X_tfidf, y_train)
    avg_score_tfidf = grid_search_tfidf_method.best_score_
    if avg_score_tfidf > best_score:
        best_score = avg_score_tfidf
        best_model = grid_search_tfidf_method
        best_model_name = name
        best_method = "tfidf"
        x_test_method = X_test_tfidf
    best_params_tfidf = grid_search_tfidf_method.best_params_
    print(f"{name} - Best Cross-Validation F1 Score (macro) with TF-IDF: {avg_score_tfidf:.4f}")
    print(f"Best Hyperparameters: {best_params_tfidf}")
    print("--------------------------")
print("Best score is {}, best model is {} using vectorization method {}".format(best_score, best_model_name, best_method))

Hyperparameter tuning for LinearSVC with Count:


GridSearchCV(cv=10, error_score=nan,
             estimator=LinearSVC(C=1.0, class_weight=None, dual=True,
                                 fit_intercept=True, intercept_scaling=1,
                                 loss='squared_hinge', max_iter=1000,
                                 multi_class='ovr', penalty='l2',
                                 random_state=None, tol=0.0001, verbose=0),
             iid='deprecated', n_jobs=None, param_grid={'C': [0.1, 1, 10]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1_macro', verbose=0)

LinearSVC - Best Cross-Validation F1 Score (macro) with Count: 0.6657
Best Hyperparameters: {'C': 1}
--------------------------
Hyperparameter tuning for MLPClassifier with Count:


GridSearchCV(cv=10, error_score=nan,
             estimator=MLPClassifier(activation='relu', alpha=0.0001,
                                     batch_size='auto', beta_1=0.9,
                                     beta_2=0.999, early_stopping=False,
                                     epsilon=1e-08, hidden_layer_sizes=(100,),
                                     learning_rate='constant',
                                     learning_rate_init=0.001, max_fun=15000,
                                     max_iter=200, momentum=0.9,
                                     n_iter_no_change=10,
                                     nesterovs_momentum=True, power_t=0.5,
                                     random_state=None, shuffle=True,
                                     solver='adam', tol=0.0001,
                                     validation_fraction=0.1, verbose=False,
                                     warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid

MLPClassifier - Best Cross-Validation F1 Score (macro) with Count: 0.6216
Best Hyperparameters: {'hidden_layer_sizes': (50, 50)}
--------------------------
Hyperparameter tuning for Perceptron with Count:


GridSearchCV(cv=10, error_score=nan,
             estimator=Perceptron(alpha=0.0001, class_weight=None,
                                  early_stopping=False, eta0=1.0,
                                  fit_intercept=True, max_iter=1000,
                                  n_iter_no_change=5, n_jobs=None, penalty=None,
                                  random_state=0, shuffle=True, tol=0.001,
                                  validation_fraction=0.1, verbose=0,
                                  warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'alpha': [0.0001, 0.001, 0.01]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1_macro', verbose=0)

Perceptron - Best Cross-Validation F1 Score (macro) with Count: 0.6744
Best Hyperparameters: {'alpha': 0.0001}
--------------------------
Hyperparameter tuning for SGDClassifier with Count:


GridSearchCV(cv=10, error_score=nan,
             estimator=SGDClassifier(alpha=0.0001, average=False,
                                     class_weight=None, early_stopping=False,
                                     epsilon=0.1, eta0=0.0, fit_intercept=True,
                                     l1_ratio=0.15, learning_rate='optimal',
                                     loss='hinge', max_iter=1000,
                                     n_iter_no_change=5, n_jobs=None,
                                     penalty='l2', power_t=0.5,
                                     random_state=None, shuffle=True, tol=0.001,
                                     validation_fraction=0.1, verbose=0,
                                     warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'alpha': [0.0001, 0.001, 0.01]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1_macro', verbose=0)

SGDClassifier - Best Cross-Validation F1 Score (macro) with Count: 0.6983
Best Hyperparameters: {'alpha': 0.01}
--------------------------
Hyperparameter tuning for KNeighborsClassifier with Count:


GridSearchCV(cv=10, error_score=nan,
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='deprecated', n_jobs=None,
             param_grid={'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
                         'n_neighbors': [3, 5, 7, 10],
                         'weights': ['uniform', 'distance']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1_macro', verbose=0)

KNeighborsClassifier - Best Cross-Validation F1 Score (macro) with Count: 0.5404
Best Hyperparameters: {'algorithm': 'auto', 'n_neighbors': 3, 'weights': 'distance'}
--------------------------
Hyperparameter tuning for DecisionTreeClassifier with Count:


GridSearchCV(cv=10, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=None,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid={'max_depth': [None, 10, 20]}, pre_dispatch='2*n_jobs

DecisionTreeClassifier - Best Cross-Validation F1 Score (macro) with Count: 0.6268
Best Hyperparameters: {'max_depth': 20}
--------------------------
Hyperparameter tuning for LinearSVC with TF-IDF:


GridSearchCV(cv=10, error_score=nan,
             estimator=LinearSVC(C=1.0, class_weight=None, dual=True,
                                 fit_intercept=True, intercept_scaling=1,
                                 loss='squared_hinge', max_iter=1000,
                                 multi_class='ovr', penalty='l2',
                                 random_state=None, tol=0.0001, verbose=0),
             iid='deprecated', n_jobs=None, param_grid={'C': [0.1, 1, 10]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1_macro', verbose=0)

LinearSVC - Best Cross-Validation F1 Score (macro) with TF-IDF: 0.6613
Best Hyperparameters: {'C': 10}
--------------------------
Hyperparameter tuning for MLPClassifier with TF-IDF:


GridSearchCV(cv=10, error_score=nan,
             estimator=MLPClassifier(activation='relu', alpha=0.0001,
                                     batch_size='auto', beta_1=0.9,
                                     beta_2=0.999, early_stopping=False,
                                     epsilon=1e-08, hidden_layer_sizes=(100,),
                                     learning_rate='constant',
                                     learning_rate_init=0.001, max_fun=15000,
                                     max_iter=200, momentum=0.9,
                                     n_iter_no_change=10,
                                     nesterovs_momentum=True, power_t=0.5,
                                     random_state=None, shuffle=True,
                                     solver='adam', tol=0.0001,
                                     validation_fraction=0.1, verbose=False,
                                     warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid

MLPClassifier - Best Cross-Validation F1 Score (macro) with TF-IDF: 0.6167
Best Hyperparameters: {'hidden_layer_sizes': (100,)}
--------------------------
Hyperparameter tuning for Perceptron with TF-IDF:


GridSearchCV(cv=10, error_score=nan,
             estimator=Perceptron(alpha=0.0001, class_weight=None,
                                  early_stopping=False, eta0=1.0,
                                  fit_intercept=True, max_iter=1000,
                                  n_iter_no_change=5, n_jobs=None, penalty=None,
                                  random_state=0, shuffle=True, tol=0.001,
                                  validation_fraction=0.1, verbose=0,
                                  warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'alpha': [0.0001, 0.001, 0.01]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1_macro', verbose=0)

Perceptron - Best Cross-Validation F1 Score (macro) with TF-IDF: 0.6594
Best Hyperparameters: {'alpha': 0.0001}
--------------------------
Hyperparameter tuning for SGDClassifier with TF-IDF:


GridSearchCV(cv=10, error_score=nan,
             estimator=SGDClassifier(alpha=0.0001, average=False,
                                     class_weight=None, early_stopping=False,
                                     epsilon=0.1, eta0=0.0, fit_intercept=True,
                                     l1_ratio=0.15, learning_rate='optimal',
                                     loss='hinge', max_iter=1000,
                                     n_iter_no_change=5, n_jobs=None,
                                     penalty='l2', power_t=0.5,
                                     random_state=None, shuffle=True, tol=0.001,
                                     validation_fraction=0.1, verbose=0,
                                     warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'alpha': [0.0001, 0.001, 0.01]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1_macro', verbose=0)

SGDClassifier - Best Cross-Validation F1 Score (macro) with TF-IDF: 0.6764
Best Hyperparameters: {'alpha': 0.0001}
--------------------------
Hyperparameter tuning for KNeighborsClassifier with TF-IDF:


GridSearchCV(cv=10, error_score=nan,
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='deprecated', n_jobs=None,
             param_grid={'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
                         'n_neighbors': [3, 5, 7, 10],
                         'weights': ['uniform', 'distance']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1_macro', verbose=0)

KNeighborsClassifier - Best Cross-Validation F1 Score (macro) with TF-IDF: 0.6170
Best Hyperparameters: {'algorithm': 'auto', 'n_neighbors': 5, 'weights': 'distance'}
--------------------------
Hyperparameter tuning for DecisionTreeClassifier with TF-IDF:


GridSearchCV(cv=10, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=None,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid={'max_depth': [None, 10, 20]}, pre_dispatch='2*n_jobs

DecisionTreeClassifier - Best Cross-Validation F1 Score (macro) with TF-IDF: 0.6061
Best Hyperparameters: {'max_depth': 10}
--------------------------
Best score is 0.6982959327898952, best model is SGDClassifier using vectorization method count


# Final Predictions

In [262]:
first_five = df_test[:5]
first_five_predictions = best_model.predict(x_test_method[:5])
first_five.insert(2, 'predicted_category', first_five_predictions)
print("First 5 predictions")
first_five

last_five = df_test[-5:]
last_five_predictions = best_model.predict(x_test_method[-5:])
last_five.insert(2, 'predicted_category', last_five_predictions)
print("\nLast 5 predictions")
last_five

First 5 predictions


Unnamed: 0,test_example_id,story,predicted_category
0,0,כל קיץ אני והמשפחה נוסעים לארצות הברית לוס אנג...,m
1,1,הגעתי לשירות המדינה אחרי שנתיים כפעיל בתנועת ...,m
2,2,אחת האהבות הגדולות שלי אלו הכלבים שלי ושל אישת...,m
3,3,רגע הגיוס לצבא היה הרגע הכי משמעותי עבורי אני...,m
4,4,אני הגעתי לברזיל ישר מקולומביה וגם אני עשיתי ע...,m



Last 5 predictions


Unnamed: 0,test_example_id,story,predicted_category
318,318,בשנה האחרונה הרגשתי די תקוע בעבודה השגרה הפכה...,m
319,319,אני ואילן חברים טובים מזה שנה תמיד חלמנו לפ...,m
320,320,מידי יום שישי אני נוהג לנסוע בתחבורה ציבורית ס...,f
321,321,לפני מספר חודשים בשיא התחלואה של הגל השני עמ...,m
322,322,היום בו דיווחתי על גניבה של האוטו שלי בוקר אח...,f


In [259]:
all_test_predictions = best_model.predict(x_test_method)
df_predicted = pd.DataFrame({
    'test_example_id': df_test['test_example_id'],
    'predicted_category': all_test_predictions
})
df_predicted

Unnamed: 0,test_example_id,predicted_category
0,0,m
1,1,m
2,2,m
3,3,m
4,4,m
...,...,...
318,318,m
319,319,m
320,320,f
321,321,m


### Save output to csv (optional)
After you're done save your output to the 'classification_results.csv' csv file.<br/>
We assume that the dataframe with your results contain the following columns:
* column 1 (left column): 'test_example_id'  - the same id associated to each of the test stories to be predicted.
* column 2 (right column): 'predicted_category' - the predicted gender value for each of the associated story. 

Assuming your predicted values are in the `df_predicted` dataframe, you should save you're results as following:

In [263]:
df_predicted.to_csv('classification_results.csv',index=False)