# Assignment 5 - Text Analysis
An explanation this assignment could be found in the .pdf explanation document


## Materials to review for this assignment
<h4>From Moodle:</h4> 
<h5><u>Review the notebooks regarding the following python topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Working with strings</b> (tutorial notebook)<br/>
&#x2714; <b>Text Analysis</b> (tutorial notebook)<br/>
&#x2714; <b>Hebrew text analysis tools (tokenizer, wordnet)</b> (moodle example)<br/>
&#x2714; <b>(brief review) All previous notebooks</b><br/>
</div> 
<h5><u>Review the presentations regarding the following topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Text Analysis</b> (lecture presentation)<br/>
&#x2714; <b>(brief review) All other presentations</b><br/>
</div>

## Preceding Step - import modules (packages)
This step is necessary in order to use external modules (packages). <br/>

In [1]:
# --------------------------------------
import pandas as pd
import numpy as np
# --------------------------------------


# --------------------------------------
# ------------- visualizations:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
# --------------------------------------


# ---------------------------------------
import sklearn
from sklearn import preprocessing, metrics, pipeline, model_selection, feature_extraction 
from sklearn import naive_bayes, linear_model, svm, neural_network, neighbors, tree
from sklearn import decomposition, cluster

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV 
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, make_scorer
from sklearn.metrics import mean_squared_error, r2_score, silhouette_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder

from sklearn.feature_selection import SelectKBest, chi2 ,mutual_info_classif
from sklearn.model_selection import  cross_validate ,RepeatedKFold , RepeatedStratifiedKFold
from sklearn.ensemble import StackingClassifier , BaggingClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import Perceptron, SGDClassifier
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC 
# ---------------------------------------


# ----------------- output and visualizations: 
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)
# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
pd.pandas.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# ---------------------------------------

### Text analysis and String manipulation imports:

In [2]:
# --------------------------------------
# --------- Text analysis and Hebrew text analysis imports:
# vectorizers:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# regular expressions:
import re
# --------------------------------------

### (optional) Hebrew text analysis - WordNet (for Hebrew)
Note: the WordNet is not a must

#### (optional) Only if you didn't install Wordnet (for Hebrew) use:

In [3]:
# word net installation:

# unmark if you want to use and need to install
# !pip install wn
# !python -m wn download omw-he:1.4

In [4]:
# word net import:

# unmark if you want to use:
# import wn

### (optional) Hebrew text analysis - hebrew_tokenizer (Tokenizer for Hebrew)
Note: the hebrew_tokenizer is not a must

#### (optional) Only if you didn't install hebrew_tokenizer use:

In [5]:
# Hebrew tokenizer installation:

# unmark if you want to use and need to install:
# !pip install hebrew_tokenizer

In [6]:
# Hebrew tokenizer import:

# unmark if you want to use:
# import hebrew_tokenizer as ht

### Reading input files
Reading input files for train annotated corpus (raw text data) corpus and for the test corpus

In [7]:
train_filename = 'annotated_corpus_for_train.csv'
test_filename  = 'corpus_for_test.csv'
df_train = pd.read_csv(train_filename, index_col=None, encoding='utf-8')
df_test  = pd.read_csv(test_filename, index_col=None, encoding='utf-8')

In [8]:
df_train.head(8)
df_train.shape

Unnamed: 0,story,gender
0,"כשחבר הזמין אותי לחול, לא באמת חשבתי שזה יקרה,...",m
1,לפני שהתגייסתי לצבא עשיתי כל מני מיונים ליחידו...,m
2,מאז שהתחילו הלימודים חלומו של כל סטודנט זה הפנ...,f
3,"כשהייתי ילד, מטוסים היה הדבר שהכי ריתק אותי. ב...",m
4,‏הייתי מדריכה בכפר נוער ומתאם הכפר היינו צריכי...,f
5,לפני כ3 חודשים טסתי לרומא למשך שבוע. טסתי במטו...,f
6,אני כבר שנתיים נשוי והשנה אני ואישתי סוף סוף י...,m
7,השנה התחלנו שיפוץ בדירה שלנו בתל אביב. הדירה ה...,f


(753, 2)

In [9]:
df_test.head(3)
df_test.shape

Unnamed: 0,test_example_id,story
0,0,כל קיץ אני והמשפחה נוסעים לארצות הברית לוס אנג...
1,1,"הגעתי לשירות המדינה אחרי שנתיים כפעיל בתנועת ""..."
2,2,אחת האהבות הגדולות שלי אלו הכלבים שלי ושל אישת...


(323, 2)

### Your implementation:
Write your code solution in the following code-cells

Student: Ofek Atun

ID: 316063015

Email: Ofek.Atun14@gmail.com

In [10]:
#Pattern matching librarry 
import re

# This function gets a dataframe with Row named "story" and remove un needed data.
def clean_text_df(dataframe):
    for index, row in dataframe.iterrows():
        story = row["story"]
        # Remove numbers
        story = re.sub(r'\d+', '', story)
        # Remove non-alphanumeric characters
        story = re.sub(r'[^\w\s]', '', story)
        # Remove extra spaces
        story = re.sub(r'\s+', ' ', story)
        # Strip leading and trailing spaces
        story = story.strip()
        # Update the cleaned story in the DataFrame
        dataframe.at[index, "story"] = story
    
    return dataframe

df_train = clean_text_df(df_train)
df_test = clean_text_df(df_test)

Split data to Train and Test for out machine learning


In [11]:
def split_train_test_data(dataframe):
    # Split the dataframe into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(dataframe["story"], dataframe["gender"], test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test

# Split the training dataframe into train and test sets
X_train, X_test, y_train, y_test = split_train_test_data(df_train)



TF-IDF Vectorization for Text Data


In [12]:
def create_tfidf_vectors(X_train, X_test, ngram_range=(1, 1), min_document_frequency=5):
    tfidf_vectorizer = TfidfVectorizer(min_df=min_document_frequency, ngram_range=ngram_range)
    
    # Convert the training set into TF-IDF 
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

    # Convert the test set into TF-IDF 
    X_test_tfidf = tfidf_vectorizer.transform(X_test)
    return X_train_tfidf, X_test_tfidf


---------------------------------------------------------------------------------------------

Creating a Feature Selection 

In [13]:
def select_best_features(X_train, X_test, y_train, k=800):
    feature_selector = SelectKBest(mutual_info_classif, k=k)
    
    # Fit the feature selector using the training set and target variable
    feature_selector.fit(X_train, y_train)
    
    # Transform the training set to retain only the selected features
    X_train_selected = feature_selector.transform(X_train)
    
    # Transform the test set to retain only the selected features
    X_test_selected = feature_selector.transform(X_test)
    
    return X_train_selected, X_test_selected


---------------------------------------------------------------------------------------------

Feature Scaling using Min-Max Scaling

In [14]:
def perform_MinMaxScale(X_train_selected, X_test_selected):
    scaler = MinMaxScaler()
    
    # Scale the training set using MinMaxScaler
    X_train_scaled = scaler.fit_transform(X_train_selected.toarray())
    
    # Scale the test set using MinMaxScaler
    X_test_scaled = scaler.transform(X_test_selected.toarray())
    
    return X_train_scaled, X_test_scaled


---------------------------------------------------------------------------------------------

Finding the Best Hyper-Parameters using Grid Search

In [16]:
def find_best_HyParams_grid_search(model, parameters, X_train, y_train):
    # Perform grid search to find the best parameters
    grid_search = GridSearchCV(model, parameters, cv=5, n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    # Return the best parameters found
    return grid_search.best_params_

Using the best hyper-parameters with all the fallowing Models:

In [17]:
linear_svc_params = [{'C': [0.01, 0.1, 1, 10, 100],  'penalty': [None, 'l1', 'l2'], 'dual': [False]}]
best_params_linear_svc = find_best_HyParams_grid_search(LinearSVC(), linear_svc_params, X_train_scaled, y_train)
print("Best parameters for LinearSVC:" + str(best_params_linear_svc))

Best parameters for LinearSVC:{'C': 0.1, 'dual': False, 'penalty': 'l2'}


In [18]:
perceptron_params = [{'alpha': [0.0001, 0.05], 'penalty': [None, 'l2', 'l1', 'elasticnet']}]
best_params_perceptron = find_best_HyParams_grid_search(Perceptron(), perceptron_params, X_train_scaled, y_train)
print("Best parameters for Perceptron:" + str(best_params_perceptron))

Best parameters for Perceptron:{'alpha': 0.0001, 'penalty': 'l1'}


In [19]:
multinomialNB_params = [{ 'alpha': [0.01, 0.1, 0.5, 1],  'fit_prior': [True, False],  'class_prior': [None, [0.5, 0.5], [0.3, 0.7]]}]
best_params_multinomialNB = find_best_HyParams_grid_search(MultinomialNB(), multinomialNB_params, X_train_scaled, y_train)
print("Best parameters for MultinomialNB:" + str(best_params_multinomialNB))

Best parameters for MultinomialNB:{'alpha': 1, 'class_prior': [0.3, 0.7], 'fit_prior': True}


In [20]:
#takes alot of time to run, run if needed
mlp_params = [
    {
        'hidden_layer_sizes': [(50, 50, 50), (50, 100, 50), (100,)],
        'activation': ['tanh', 'relu'],
        'solver': ['sgd', 'adam'],
        'alpha': [0.0001, 0.05],
        'learning_rate': ['constant', 'adaptive']
    }
]

best_params_mlp = find_best_HyParams_grid_search(MLPClassifier(), mlp_params, X_train_scaled, y_train)

print("Best parameters for MLPClassifier:" + str(best_params_mlp))


Best parameters for MLPClassifier:{'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'adaptive', 'solver': 'adam'}


In [21]:

sgd_parmas = [{'loss': ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'],
               'penalty': ['l2', 'l1', 'elasticnet'],
               'alpha': [0.0001, 0.05]}]
best_params_sgd = find_best_HyParams_grid_search(SGDClassifier(), sgd_parmas, X_train_scaled, y_train)

print("Best parameters for SGDClassifier:" + str(best_params_sgd))


Best parameters for SGDClassifier:{'alpha': 0.05, 'loss': 'modified_huber', 'penalty': 'l2'}


---------------------------------------------------------------------------------------------

Model`s Initialization with Best Parameters

In [22]:
linear_model = LinearSVC(**best_params_linear_svc)  
perceptron_model = Perceptron(**best_params_perceptron)  
multinomialNB_model = MultinomialNB(**best_params_multinomialNB)  
sgd_model = SGDClassifier(**best_params_sgd) 


In [23]:
#run only if used MLP
mlp_model = MLPClassifier(**best_params_mlp) 

This function is Training, Predicting, and Evaluating a Model.

In [24]:
def fit_predict_evaluate(model, X_train, X_test, y_train, y_test):
    # Train the model using the training data
    model_trained = model.fit(X_train, y_train)
    
    # Define the cross-validation strategy
    cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
    
    # Define the scoring metric (F1 score with micro averaging)
    f1_scorer = make_scorer(f1_score, average='micro')
    
    # Perform cross-validation and calculate F1 scores
    scores = cross_val_score(model_trained, X_test, y_test, scoring=f1_scorer, cv=cv, n_jobs=-1)
    
    # Print the F1 scores obtained during cross-validation
    print(scores)
    
    # Predict the target variable on the test data
    y_pred = model_trained.predict(X_test)
    
    # Print the predicted labels
    print(y_pred)
    
    # Calculate the F1 scores for each class (male and female)
    f1_male = f1_score(y_test, y_pred, pos_label="m")
    f1_female = f1_score(y_test, y_pred, pos_label="f")
    
    # Calculate the average F1 score
    f1_average = (f1_male + f1_female) / 2
    
    # Return the trained model and the average F1 score
    return model_trained, f1_average


Training, Predicting, and Evaluating our Model`s.

In [25]:
linear_trained, f1_average_linear = fit_predict_evaluate(linear_model, X_train_scaled, X_test_scaled, y_train, y_test)
print("f1_score Linear_svc: " + str(f1_average_linear))

perceptron_trained, f1_average_perceptron = fit_predict_evaluate(perceptron_model, X_train_scaled, X_test_scaled, y_train, y_test)
print("f1_score perceptron: " + str(f1_average_perceptron))

multinomialNB_trained, f1_average_multinomialNB = fit_predict_evaluate(multinomialNB_model, X_train_scaled, X_test_scaled, y_train, y_test)
print("f1_score MultinomialNB: " + str(f1_average_multinomialNB))

sgd_trained, f1_average_sgd = fit_predict_evaluate(sgd_model, X_train_scaled, X_test_scaled, y_train, y_test)
print("f1_score SGD: " + str(f1_average_sgd))


[0.74193548 0.56666667 0.83333333 0.83333333 0.83333333 0.87096774
 0.86666667 0.8        0.7        0.7        0.74193548 0.63333333
 0.7        0.93333333 0.8       ]
['m' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm'
 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'f' 'm' 'm' 'm' 'm' 'm'
 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm'
 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f'
 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'm' 'm' 'm']
f1_score Linear_svc: 0.6864303616183316
[0.74193548 0.6        0.86666667 0.73333333 0.83333333 0.83870968
 0.83333333 0.76666667 0.76666667 0.7        0.74193548 0.66666667
 0.76666667 0.9        0.76666667]
['m' 'm'

In [26]:
#run only if used MLP
mlp_trained, f1_average_mlp = fit_predict_evaluate(mlp_model, X_train_scaled, X_test_scaled, y_train, y_test)
print("f1_score MLP:" + str(f1_average_mlp))



[0.70967742 0.56666667 0.83333333 0.86666667 0.8        0.77419355
 0.83333333 0.76666667 0.7        0.7        0.80645161 0.6
 0.7        0.93333333 0.8       ]
['m' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'm' 'm' 'f' 'f' 'm' 'm'
 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm'
 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'm' 'm'
 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'm' 'f' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'm'
 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'f' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm'
 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'f' 'm' 'm' 'm'
 'm' 'f' 'm' 'm' 'm' 'm' 'm']
f1_score MLP:0.6880165289256198


Function to create TF-IDF vectors from text data

---------------------------------------------------------------------------------------------

In [29]:

def create_tfidf_vectors(X, ngram_range=(1, 1), min_df=5):
    tfidf_vectorizer = TfidfVectorizer(min_df=min_df, ngram_range=ngram_range)
    
    # Convert the input data X into TF-IDF vectors
    X_tfidf = tfidf_vectorizer.fit_transform(X)
    
    # Return the TF-IDF vectors
    return X_tfidf


Function to select the best features using Mutual Information

In [30]:

def select_best_features_test(X, k=800):
    selector = SelectKBest(mutual_info_classif, k=k)
    
    # Perform feature selection using Mutual Information
    # The second argument is a placeholder array since the target variable is not used in this case
    X_selected = selector.fit_transform(X, np.zeros(X.shape[0]))
    
    # Return the selected features
    return X_selected


Function to perform Min-Max scaling on feature vectors

In [31]:

def perform_MinMaxScale_test(X):
    scaler = MinMaxScaler()
    
    # Perform Min-Max scaling on the feature vectors
    # The `toarray()` method is used to convert sparse matrices to dense arrays
    X_scaled = scaler.fit_transform(X.toarray())
    
    # Return the scaled feature vectors
    return X_scaled


Transforming and predicting on test data

In [33]:

X_df_test = df_test["story"]  # Extracting the stories from df_test
X_df_test = create_tfidf_vectors(X_df_test)  # Applying TF-IDF vectorization to X_df_test
X_df_test = select_best_features_test(X_df_test)  # Selecting the k best features from X_df_test
X_df_test = perform_MinMaxScale_test(X_df_test)  # Performing min-max scaling on X_df_test

# Predicting the categories on df_test using the trained model
y_prediction_test = sgd_trained.predict(X_df_test)

# Extracting the text example IDs from df_test
df_text_example = df_test.test_example_id


ValueError: X has 1000 features, but SGDClassifier is expecting 800 features as input.

### Save output to csv (optional)
After you're done save your output to the 'classification_results.csv' csv file.<br/>
We assume that the dataframe with your results contain the following columns:
* column 1 (left column): 'test_example_id'  - the same id associated to each of the test stories to be predicted.
* column 2 (right column): 'predicted_category' - the predicted gender value for each of the associated story. 

Assuming your predicted values are in the `df_predicted` dataframe, you should save you're results as following:

In [None]:
# Creating a dataframe with the predicted categories and their corresponding text example IDs
df_predicted = pd.DataFrame({
    "test_example_id": df_text_example.tolist(),
    "predicted_category": y_prediction_test.tolist()
})  

df_predicted.to_csv('classification_results.csv',index=False)