# Assignment 3 - Text Analysis
An explanation this assignment could be found in the .pdf explanation document


## Materials to review for this assignment
<h4>From Moodle:</h4> 
<h5><u>Review the notebooks regarding the following python topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Working with strings</b> (tutorial notebook)<br/>
&#x2714; <b>Text Analysis</b> (tutorial notebook)<br/>
&#x2714; <b>Hebrew text analysis tools (tokenizer, wordnet)</b> (moodle example)<br/>
&#x2714; <b>(brief review) All previous notebooks</b><br/>
</div> 
<h5><u>Review the presentations regarding the following topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Text Analysis</b> (lecture presentation)<br/>
&#x2714; <b>(brief review) All other presentations</b><br/>
</div>

## Personal Details:

In [None]:
# Details Student 1: Bar Elimelech

# Details Student 2: Nitzan Kohan

## Preceding Step - import modules (packages)
This step is necessary in order to use external modules (packages). <br/>

In [None]:
# --------------------------------------
import pandas as pd
import numpy as np
# --------------------------------------


# --------------------------------------
# ------------- visualizations:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
# --------------------------------------


# ---------------------------------------
import sklearn
from sklearn import preprocessing, metrics, pipeline, model_selection, feature_extraction 
from sklearn import naive_bayes, linear_model, svm, neural_network, neighbors, tree
from sklearn import decomposition, cluster

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV 
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, r2_score, silhouette_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder

from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import Perceptron, SGDClassifier
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
# ---------------------------------------


# ----------------- output and visualizations: 
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)
# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
pd.pandas.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# ---------------------------------------

### Text analysis and String manipulation imports:

In [None]:
# --------------------------------------
# --------- Text analysis and Hebrew text analysis imports:
# vectorizers:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# regular expressions:
import re
# --------------------------------------

### (optional) Hebrew text analysis - WordNet (for Hebrew)
Note: the WordNet is not a must

#### (optional) Only if you didn't install Wordnet (for Hebrew) use:

In [None]:
# word net installation:

# unmark if you want to use and need to install
# !pip install wn
# !python -m wn download omw-he:1.4

In [None]:
# word net import:

# unmark if you want to use:
# import wn

### (optional) Hebrew text analysis - hebrew_tokenizer (Tokenizer for Hebrew)
Note: the hebrew_tokenizer is not a must

#### (optional) Only if you didn't install hebrew_tokenizer use:

In [None]:
# Hebrew tokenizer installation:

# unmark if you want to use and need to install:
# !pip install hebrew_tokenizer

In [None]:
# Hebrew tokenizer import:

# unmark if you want to use:
# import hebrew_tokenizer as ht

### Reading input files
Reading input files for train annotated corpus (raw text data) corpus and for the test corpus

In [None]:
train_filename = 'annotated_corpus_for_train.csv'
test_filename  = 'corpus_for_test.csv'
df_train = pd.read_csv(train_filename, index_col=None, encoding='utf-8')
df_test  = pd.read_csv(test_filename, index_col=None, encoding='utf-8')

In [None]:
df_train.head(8)
df_train.shape

In [None]:
df_test.head(3)
df_test.shape

### Your implementation:
Write your code solution in the following code-cells

### Clean dataset

In [None]:
def remove_non_heb(text):    
    hebrew_pattern = re.compile(r'[^\u05D0-\u05EA\s]+')
    return hebrew_pattern.sub('', text)

def clean_dataset(df):
    clean_df = df.copy()
    clean_df.dropna(inplace=True)
    clean_df.drop_duplicates(["story"], inplace=True)
    clean_df['story'] = clean_df['story'].apply(lambda x: remove_non_heb(x))
    
    return clean_df

In [None]:
clean_df = clean_dataset(df_train)

In [None]:
clean_df.head(15)

In [None]:
X = clean_df['story']
y = clean_df['gender']

### Create vercorizers

In [None]:
count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform(X)

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(X)

### Split train test for both vectorizer matrix

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    count_matrix, 
    y, 
    test_size=0.2, 
    random_state=42
)

In [None]:
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(
    tfidf_matrix, 
    y, 
    test_size=0.2, 
    random_state=42
)

### Define classifier params

In [None]:
dt_param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

knn_param_grid = {
    "n_neighbors": [3, 5, 7, 9, 11],
    "weights": ["uniform", "distance"],
    "p": [1, 2],
}

svc_param_grid = {
    'C': [0.1, 1, 10]
}

lr_param_grid = {
    'C': [0.1, 1, 10],
    'penalty': ['l1', 'l2']
}

### Init classifiers

In [None]:
dt_classifier = DecisionTreeClassifier()
knn_classifier = KNeighborsClassifier()
svc_classifier = LinearSVC()
lr_classifier = linear_model.LogisticRegression()

### Grid search

In [None]:
dt_grid_search = GridSearchCV(estimator=dt_classifier, param_grid=dt_param_grid, scoring=metrics.make_scorer(f1_score, average='macro'), cv=10, n_jobs=-1)
knn_grid_search = GridSearchCV(estimator=knn_classifier, param_grid=knn_param_grid, scoring=metrics.make_scorer(f1_score, average='macro'), cv=10, n_jobs=-1)
svc_grid_search = GridSearchCV(estimator=svc_classifier, param_grid=svc_param_grid, scoring=metrics.make_scorer(f1_score, average='macro'), cv=10, n_jobs=-1)
lr_grid_search = GridSearchCV(estimator=lr_classifier, param_grid=lr_param_grid, scoring=metrics.make_scorer(f1_score, average='macro'), cv=10, n_jobs=-1)

### Fit - counterVectorizer

In [None]:
counter_dt_res = dt_grid_search.fit(X_train, y_train)
counter_knn_res = knn_grid_search.fit(X_train, y_train)
counter_svc_res = svc_grid_search.fit(X_train, y_train)
counter_lr_res = lr_grid_search.fit(X_train, y_train)

### Fit - TfidfVectorizer

In [None]:
tfidf_dt_res = dt_grid_search.fit(X_train_tfidf, y_train_tfidf)
tfidf_knn_res = knn_grid_search.fit(X_train_tfidf, y_train_tfidf)
tfidf_svc_res = svc_grid_search.fit(X_train_tfidf, y_train_tfidf)
tfidf_lr_res = lr_grid_search.fit(X_train_tfidf, y_train_tfidf)

### Compare results

In [None]:
tfidf_grid_list = [tfidf_dt_res, tfidf_knn_res, tfidf_svc_res, tfidf_lr_res]
counter_grid_list = [counter_dt_res, counter_knn_res, counter_svc_res, counter_lr_res]

print("TfidfVectorizer RESULTS:")
for grid in tfidf_grid_list:
    print(f"\nESTIMATOR: {grid.estimator}")
    print(f"- BEST PARAMS: {grid.best_params_}")
    print(f"- BEST SCORE: {grid.best_score_}")
    print(f"- PREDICTION SCORE: {f1_score(y_test_tfidf, grid.predict(X_test_tfidf), average='macro')}")
    
print("\ncounterVectorizer RESULTS:")
for grid in counter_grid_list:
    print(f"\nESTIMATOR: {grid.estimator}")
    print(f"- BEST PARAMS: {grid.best_params_}")
    print(f"- BEST SCORE: {grid.best_score_}")
    print(f"- PREDICTION SCORE: {f1_score(y_test, grid.predict(X_test), average='macro')}")

### Best results

In [None]:
df_test.head()

In [None]:
# Iinit LinearSVC with best param - {C: 10}
best_svc_classifier = LinearSVC(C=10)
best_svc_classifier.fit(X_train, y_train)

# Remove non hebrew characters from df_test stories  
df_test['story'] = df_test['story'].apply(lambda x: remove_non_heb(x))
new_count_matrix = count_vectorizer.transform(df_test['story'])
best_y_pred = svc_classifier.predict(new_count_matrix)

In [None]:
df_predicted = pd.DataFrame({'test_example_id': df_test['test_example_id'], 'predicted_category':best_y_pred})

In [None]:
display(df_predicted.head(5), df_predicted.tail(5))

### Save output to csv (optional)
After you're done save your output to the 'classification_results.csv' csv file.<br/>
We assume that the dataframe with your results contain the following columns:
* column 1 (left column): 'test_example_id'  - the same id associated to each of the test stories to be predicted.
* column 2 (right column): 'predicted_category' - the predicted gender value for each of the associated story. 

Assuming your predicted values are in the `df_predicted` dataframe, you should save you're results as following:

In [None]:
df_predicted.to_csv('classification_results.csv',index=False)