# Hybrid Core Technical Task

### Import libraries and load the data

In [1]:
import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, f1_score, recall_score, accuracy_score

from sklearn.feature_extraction.text import TfidfVectorizer

import pickle

In [3]:
df=pd.read_excel("Task Data.xlsx")

In [4]:
df.sample(10)

Unnamed: 0,DocumentId,Locations,Headline,Abstract,First Part,Event Category,1st Level Sub Category,2nd Level Sub Category,3rd Level Sub Category
139,ConflictDocId105,"['Tripoli', 'Tunis', 'Libya', 'Misrata']",Tripoli airport closes again after rocket fire,TRIPOLI (Reuters) - The only functioning airpo...,Mitiga had only reopened on Jan. 14 after mont...,conflicts,armedConflicts,explosionsRemoteViolence,
0,AgreementsDocId1,"['Italy', 'Libya', 'Tripoli']",Al-Sarraj reviews with Eni officials support f...,The Head of the Libyan Presidential Council Fa...,The meeting was held in the presence of Eni CE...,agreements,governmentPrivateActorAgreements,Energy agreements / meetings / visits,
371,PoliticalDocId12,Libya,POLITICAL INSTABILITY AFFECTING CREDIT ACCESS ...,The monetary crisis currently entangling Libya...,One of the main factors that has contributed t...,politicalEvents,badGovernance,,
432,TerrorDocId2,Benghazi,Wide condemnation of assassination of female l...,There has been wide condemnation of yesterday'...,"The EU, UK and German Embassies also condemned...",terror,assassination,,
49,ConflictDocId20,"['Libya', 'Tripoli', 'Cairo', 'Turkey', 'Syria...",Libyan gov't abducts anticorruption official i...,CAIRO (AP) -- One of Libya's top anti-corrupti...,The audit bureau is an independent body appoin...,conflicts,armedConflicts,violenceAgainstCivilians,Abduction/forced disappearance
226,DiplomaticDocId64,"['Libya', 'Maio, Cape Verde', 'Egypt', 'Italy'...","Egypt, Italy FMs discuss achieving comprehensi...","According to an official statement, Egyptian f...",The two ministers also discussed the latest de...,diplomatic,diplomaticAgreements,,
205,DiplomaticDocId39,"['United Kingdom', 'Libya', 'Tripoli']","CBL governor, UK ambassador discuss return of ...",The Governor of the Central Bank of Libya (CBL...,The meeting also touched on ways to activate t...,diplomatic,diplomaticAgreements,Economic agreements / meetings / visits,
36,ConflictDocId9,"['United Arab Emirates', 'Libya']",Two Russian air defense systems destroyed in A...,Haftar's militias,The second Russian-made Pantsir anti-aircraft ...,conflicts,armedConflicts,explosionsRemoteViolence,Air/drone attack / air defence
468,TerrorDocId34,al-Ghani oil field near Zalla,IS Takes Foreign Hostages at Oilfield,Nine foreigners working for the Malta-based Au...,They were working at the al-Ghani oil field (p...,terror,hostageTaking,,
126,ConflictDocId92,"['Moscow', 'Turkey', 'Russia', 'Libya', 'Tripo...",Russia throws down the gauntlet in Libya with ...,Russia has upped the stakes in Libya after dep...,"""It is a big play for Russia, doing it in broa...",conflicts,nonViolentConflicts,militaryPreparations,


### Exploratory Data Analysis

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 489 entries, 0 to 488
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   DocumentId              463 non-null    object
 1   Locations               436 non-null    object
 2   Headline                462 non-null    object
 3   Abstract                458 non-null    object
 4   First Part              449 non-null    object
 5   Event Category          462 non-null    object
 6   1st Level Sub Category  462 non-null    object
 7   2nd Level Sub Category  249 non-null    object
 8   3rd Level Sub Category  103 non-null    object
dtypes: object(9)
memory usage: 34.5+ KB


We have 489 examples of data with a lot of missing values

In [6]:
# Since DocumentId is a Unique for all examples (documents) we can drop this feature.
df.drop("DocumentId", axis=1, inplace=True)

In [7]:
#let's see how many documents have all null values
sum(df.isnull().all(axis=1))

27

27 documents have all null values

In [8]:
# Drop the documents with all null values
df.drop(df.index[df.isnull().all(axis=1)], inplace=True)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 462 entries, 0 to 488
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Locations               436 non-null    object
 1   Headline                462 non-null    object
 2   Abstract                458 non-null    object
 3   First Part              449 non-null    object
 4   Event Category          462 non-null    object
 5   1st Level Sub Category  462 non-null    object
 6   2nd Level Sub Category  249 non-null    object
 7   3rd Level Sub Category  103 non-null    object
dtypes: object(8)
memory usage: 32.5+ KB


We have 462 documents remaining

In [10]:
df.iloc[0]

Locations                                     ['Italy', 'Libya', 'Tripoli']
Headline                  Al-Sarraj reviews with Eni officials support f...
Abstract                  The Head of the Libyan Presidential Council Fa...
First Part                The meeting was held in the presence of Eni CE...
Event Category                                                   agreements
1st Level Sub Category                     governmentPrivateActorAgreements
2nd Level Sub Category                Energy agreements / meetings / visits
3rd Level Sub Category                                                  NaN
Name: 0, dtype: object

Location feature is a list of locations.

Let's see how many items in Location feature.

In [11]:
def countitems(s):
    s=str(s)
    return len(s.strip('][').split(', '))

lcount=df.Locations.apply(lambda x: countitems(x))
lcount

0      3
1      2
2      3
3      4
4      4
      ..
483    7
484    3
486    9
487    3
488    2
Name: Locations, Length: 462, dtype: int64

In [12]:
def loclist(s):
    s=str(s)
    return s.strip('][').replace("'","").split(', ')

loc=df.Locations.apply(lambda x: loclist(x))
loc

0                                [Italy, Libya, Tripoli]
1                                         [Italy, Libya]
2                                [Libya, Tripoli, Sirte]
3                         [Libya, Sweden, France, Brega]
4                ["Ras Lanuf", Tripoli, Benghazi, Libya]
                             ...                        
483    [Sirte, Tripoli, Benghazi, Libya, Russia, Egyp...
484                           [Tripoli, Libya, Benghazi]
486    [Benghazi, Tripoli, Libya, Turkey, Russia, Egy...
487                                 [Chad, Sudan, Libya]
488                                     [Marj, Benghazi]
Name: Locations, Length: 462, dtype: object

In [13]:
sum(lcount==1)

124

In [14]:
df.loc[375]

Locations                                                               NaN
Headline                  Libya's Fragmentation: Structure and Process i...
Abstract                  After the overthrow of the Qadhafi regime in 2...
First Part                Rarely does internal division and political fr...
Event Category                                              politicalEvents
1st Level Sub Category                                        badGovernance
2nd Level Sub Category                                                  NaN
3rd Level Sub Category                                                  NaN
Name: 375, dtype: object

In [15]:
loc[87]

['Benghazi', 'Tripoli', 'Libya', 'Turkey', 'Russia']

In [17]:
print("Maximum location count:",max(lcount))
print("Minimum location count:",min(lcount))
print("Average location count:",np.mean(lcount))
print("Median of the location counts:",np.median(lcount))


Maximum location count: 20
Minimum location count: 1
Average location count: 3.7510822510822512
Median of the location counts: 3.0


Average location count is 3.75

Let's see how many words there are in Headline feature

In [18]:
hwcount=df.Headline.apply(lambda x: len(x.split()))

In [19]:
hwcount.value_counts()

10    83
8     79
9     79
11    51
7     42
12    38
6     30
13    18
15    11
14     7
4      5
5      3
16     3
17     3
18     3
20     2
3      2
19     1
22     1
23     1
Name: Headline, dtype: int64

Average word count in Headline feature is 9.65

In [20]:
def notnullwordcount(x):
    if type(x)==float:
        return 0
    else:
        return len(x.split())

In [21]:
awcount=df.Abstract.apply(lambda x: notnullwordcount(x))

In [22]:
awcount.mean()

59.27272727272727

Average word count in Abstract feature is 59.27

In [23]:
fpwcount=df["First Part"].apply(lambda x: notnullwordcount(x))

In [24]:
fpwcount.mean()

88.94155844155844

In [25]:
print("Maximum word counts:",max(hwcount),max(awcount),max(fpwcount))
print("Average word counts:",np.mean(hwcount),np.mean(awcount),np.mean(fpwcount))

Maksimum Kelime Sayıları: 23 140 221
Ortalama Kelime Sayıları: 9.651515151515152 59.27272727272727 88.94155844155844


In [26]:
df["Event Category"].nunique()

10

In [27]:
df["Event Category"].value_counts(dropna=False)

conflicts                 133
diplomatic                 77
economicEvents             54
naturalDisasters           49
terror                     39
societalChallenges         36
politicalEvents            33
agreements                 25
uprising                   10
technologicalDisasters      6
Name: Event Category, dtype: int64

"Event Category" and "1st Level Sub Category" features has no missing values

In [28]:
df["Event Category"].value_counts(dropna=False).index

Index(['conflicts', 'diplomatic', 'economicEvents', 'naturalDisasters',
       'terror', 'societalChallenges', 'politicalEvents', 'agreements',
       'uprising', 'technologicalDisasters'],
      dtype='object')

### Classification for "Event Category"

#### Data preparation

We need to convert the data that machine can understand.

In [29]:
print("Headline:\n",df.iloc[0,1])
print("Abstract:\n",df.iloc[0,2])
print("First Part:\n",df.iloc[0,3])

Headline:
 Al-Sarraj reviews with Eni officials support for Libya's electricity
Abstract:
 The Head of the Libyan Presidential Council Fayez Al-Sarraj discussed Monday with Italian oil giant Eni officials possible investments of the Italian company in Libya in development projects in areas where the company operates and support the electricity sector. The discussion came in a meeting in Tripoli where Eni officials and Al-Sarraj reviewed the work of the Italian company.
First Part:
 The meeting was held in the presence of Eni CEO Claudio Descalzi and other officials from the Italian company, in addition to the Chairman of the National Oil Corporation Mustafa Sanallah.


We have three text features to classify the data. We may concatinate them to have bigger data.

In [31]:
df["texts"]=df['Headline'] + " " + df['Abstract'].fillna("") + " " + df['First Part'].fillna("")

In this approach we relate word counts with categorization. We use NLTK library for this process.

In [32]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mehme\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mehme\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

We should follow these steps for data cleaning:
<br> 1. We should make all letters lower case. We don't want to count book and Book seperately.
<br> 2. We should get rid of the panctuations.
<br> 3. We should get rid of the stop words. Stop words are the words that are used a lot but not directly related with the meaning.
<br> 4. We should count all the words in the root form. We don't want to count book and books seperately.

In [33]:
stop_words = stopwords.words('english')

In [34]:
def cleantext(text):
    import re

    #Convert to lower case
    text=text.lower()

    # Get rid of the punctuations
    text_without_punc = re.sub(r'[^\w\s]', '', text)

    # Get rid of the numbers
    text_without_num=re.sub(r'[0-9]', '', text_without_punc)

    #Remove Stopwords
    text_without_sw = [t for t in text_without_num.split() if t not in stop_words]

    # Find the roots
    lemmatized= [WordNetLemmatizer().lemmatize(t) for t in text_without_sw]

    #Join words again
    return " ".join(lemmatized)

In [35]:
df["cleaned_text"]=df["texts"].apply(cleantext)

In [36]:
df["cleaned_text"]

0      alsarraj review eni official support libya ele...
1      noc eni review resuming stalled project chairm...
2      halliburton discus increased cooperation noc i...
3      libya say total mull investment nation oil fie...
4      zallaf noc discus new south refinery wide rang...
                             ...                        
483    protester dispersed gunfire libya capital trip...
484    libyan protestors torch eastern government off...
486    protest flare libya benghazi power cut living ...
487    protest hun haftar mercenary kill local citize...
488    eastern libya demonstrator shot burn state bui...
Name: cleaned_text, Length: 462, dtype: object

#### Vectorization

After cleainin the text we should convert the words to numbers which is called vectorization. There are two types of vectorization:
<br> *Bag of Words (BoW):* We count each word for each document.
<br> *TFIDF Vectorizer* Term Frequency-Inverse Document Frequency: We take word counts in the whole corpus into account.
<br> We use TFIDF for this project.

Since TFIDF uses the whole corpus, to prevent data leakage we should split to data first.

In [37]:
X = df["cleaned_text"]
y = df["Event Category"]

In [38]:
# Since meteorological and hydrological events are so rare we use stratify=y so that test data have these types of events
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=101)

In [39]:
y_test

146           conflicts
6            agreements
280    naturalDisasters
296    naturalDisasters
241          diplomatic
             ...       
392     politicalEvents
440              terror
112           conflicts
368     politicalEvents
115           conflicts
Name: Event Category, Length: 139, dtype: object

In [40]:
tf_idf_vectorizer = TfidfVectorizer()
X_train_tf_idf = tf_idf_vectorizer.fit_transform(X_train)
X_test_tf_idf = tf_idf_vectorizer.transform(X_test)

In [42]:
# Unique tokens:
tf_idf_vectorizer.get_feature_names()

['abandoned',
 'abc',
 'abdalla',
 'abdel',
 'abdelfatah',
 'abducted',
 'abduction',
 'abdul',
 'abdullah',
 'abdulrasoul',
 'abdulsalam',
 'abide',
 'ability',
 'ablaze',
 'able',
 'aboard',
 'abortedhowever',
 'abortive',
 'abovementioned',
 'abroad',
 'abroadthe',
 'absence',
 'absent',
 'absolute',
 'abu',
 'abuse',
 'abuser',
 'academy',
 'accelerated',
 'accelerating',
 'accept',
 'acceptable',
 'accepted',
 'accepting',
 'accepts',
 'access',
 'accessory',
 'accompanied',
 'accompanying',
 'accomplish',
 'accord',
 'accordance',
 'according',
 'accordviolence',
 'account',
 'accountability',
 'accountable',
 'accountableaccording',
 'accounted',
 'accrued',
 'accumulating',
 'accumulation',
 'accuracy',
 'accurate',
 'accusation',
 'accuse',
 'accused',
 'accuses',
 'accusing',
 'achieve',
 'achieved',
 'achievement',
 'achieving',
 'acid',
 'acquiring',
 'acquisition',
 'across',
 'act',
 'acting',
 'action',
 'activate',
 'active',
 'activist',
 'activity',
 'actor',
 'actual

In [44]:
# TFIDF Vectors
df_train_tfidf = pd.DataFrame(X_train_tf_idf.toarray(), columns = tf_idf_vectorizer.get_feature_names(),
                              index= X_train.index)
df_train_tfidf

Unnamed: 0,abandoned,abc,abdalla,abdel,abdelfatah,abducted,abduction,abdul,abdullah,abdulrasoul,...,zawiya,zayed,zela,zero,zintan,ziyad,zliten,zone,zueitina,zvezda
347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.115598,0.0,0.0,0.0,0.0,0.000000,0.0
252,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.377404,0.0
316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
360,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.156773,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
365,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
481,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0


In [159]:
with open('vectorizer.pickle', 'wb') as fin:
  pickle.dump(df_train_tfidf, fin)

#### Training the Models

The data is so imbalanced and so few.It is better to use f1_score for imbalanced data but since we have 3 category and just 2 samples are from 2 categories. Precision and recall will most probably be 0 for those categories. So we use accuracy for demonstration purposes
<br>First we build an evaluation function to standardize the results
<br> We use GridSearchCV with 10 fold cross validation for hyperparameter tuning

In [45]:
def eval(model, X_train, X_test):
    y_pred = model.predict(X_test)
    y_pred_train = model.predict(X_train)
    print(confusion_matrix(y_test, y_pred))
    print("Test_Set")
    print(classification_report(y_test,y_pred))
    print("Train_Set")
    print(classification_report(y_train,y_pred_train))

##### Naive Bayes

In [47]:
from sklearn.naive_bayes import MultinomialNB

In [52]:
# Cross Validation takes long time
nb = MultinomialNB()
param={'alpha': [0.00001, 0.0001, 0.001, 0.1, 1, 10, 100,1000]}
clf=GridSearchCV(nb,param,scoring='accuracy',cv=10,return_train_score=True)
clf.fit(X_train_tf_idf,y_train)



In [53]:
clf.best_params_

{'alpha': 0.0001}

In [48]:
nb = MultinomialNB(alpha=0.0001)
nb.fit(X_train_tf_idf,y_train)

MultinomialNB(alpha=0.0001)

In [49]:
print("NAIVE BAYES MODEL")
eval(nb, X_train_tf_idf, X_test_tf_idf)

NAIVE BAYES MODEL
[[ 3  0  3  1  0  0  0  0  0  0]
 [ 0 37  2  1  0  0  0  0  0  0]
 [ 0  5 16  0  1  0  1  0  0  0]
 [ 1  0  2 12  0  0  1  0  0  0]
 [ 0  1  0  0 10  0  4  0  0  0]
 [ 0  3  2  0  0  5  0  0  0  0]
 [ 1  2  0  1  3  0  3  1  0  0]
 [ 0  2  0  0  0  0  0  0  0  0]
 [ 0  1  0  0  0  0  0  0 11  0]
 [ 0  2  0  0  0  0  0  0  0  1]]
Test_Set
                        precision    recall  f1-score   support

            agreements       0.60      0.43      0.50         7
             conflicts       0.70      0.93      0.80        40
            diplomatic       0.64      0.70      0.67        23
        economicEvents       0.80      0.75      0.77        16
      naturalDisasters       0.71      0.67      0.69        15
       politicalEvents       1.00      0.50      0.67        10
    societalChallenges       0.33      0.27      0.30        11
technologicalDisasters       0.00      0.00      0.00         2
                terror       1.00      0.92      0.96        12
 

##### Logistic Regression

In [51]:
from sklearn.linear_model import LogisticRegression

In [62]:
#Cross valitation takes long time
log = LogisticRegression(multi_class="multinomial",solver="lbfgs",penalty='l2', max_iter=1000)
param={'C': [0.0001, 0.001, 0.01, 0.1, 1.0]}
clf=GridSearchCV(log,param,scoring='accuracy',cv=10,return_train_score=True)
clf.fit(X_train_tf_idf,y_train)




In [63]:
clf.best_params_

{'C': 1.0}

In [52]:
log = LogisticRegression(C=1, multi_class="multinomial",solver="lbfgs", penalty='l2')
log.fit(X_train_tf_idf,y_train)

LogisticRegression(C=1, multi_class='multinomial')

In [53]:
print("LOGISTIC REGRESSION MODEL")
eval(log, X_train_tf_idf, X_test_tf_idf)

LOGISTIC REGRESSION MODEL
[[ 0  1  3  3  0  0  0  0  0  0]
 [ 0 38  0  2  0  0  0  0  0  0]
 [ 0  8 14  0  1  0  0  0  0  0]
 [ 0  3  1 12  0  0  0  0  0  0]
 [ 0  2  0  0 11  0  2  0  0  0]
 [ 0  3  5  0  0  2  0  0  0  0]
 [ 0  6  4  0  1  0  0  0  0  0]
 [ 0  2  0  0  0  0  0  0  0  0]
 [ 0  2  0  0  0  0  0  0 10  0]
 [ 0  3  0  0  0  0  0  0  0  0]]
Test_Set
                        precision    recall  f1-score   support

            agreements       0.00      0.00      0.00         7
             conflicts       0.56      0.95      0.70        40
            diplomatic       0.52      0.61      0.56        23
        economicEvents       0.71      0.75      0.73        16
      naturalDisasters       0.85      0.73      0.79        15
       politicalEvents       1.00      0.20      0.33        10
    societalChallenges       0.00      0.00      0.00        11
technologicalDisasters       0.00      0.00      0.00         2
                terror       1.00      0.83      0.91    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


##### Support Vector Machines (SVM)

In [54]:
from sklearn.svm import SVC

In [84]:
#Cross Validation takes long time

svc = SVC(max_iter=10000)
param= {'C': [0.001, 0.05, 0.01, 0.1],
              'gamma': ["scale", "auto", 0.2, 0.3, 0.5],
              'kernel': ['rbf', 'linear','poly'],
              'class_weight': ["balanced", None],
              'degree':[2,3,4,5,6]}
clf=GridSearchCV(svc,param,scoring='accuracy',cv=10,return_train_score=True)
clf.fit(X_train_tf_idf,y_train)



In [85]:
clf.best_params_

{'C': 0.001,
 'class_weight': None,
 'degree': 2,
 'gamma': 'scale',
 'kernel': 'rbf'}

In [55]:
svc = SVC(C=0.001, class_weight= None, degree=2, gamma='scale', kernel='rbf')
svc.fit(X_train_tf_idf,y_train)

SVC(C=0.001, degree=2)

In [56]:
print("SUPPORT VECTOR MACHINES MODEL")
eval(svc, X_train_tf_idf, X_test_tf_idf)

SUPPORT VECTOR MACHINES MODEL
[[ 0  7  0  0  0  0  0  0  0  0]
 [ 0 40  0  0  0  0  0  0  0  0]
 [ 0 23  0  0  0  0  0  0  0  0]
 [ 0 16  0  0  0  0  0  0  0  0]
 [ 0 15  0  0  0  0  0  0  0  0]
 [ 0 10  0  0  0  0  0  0  0  0]
 [ 0 11  0  0  0  0  0  0  0  0]
 [ 0  2  0  0  0  0  0  0  0  0]
 [ 0 12  0  0  0  0  0  0  0  0]
 [ 0  3  0  0  0  0  0  0  0  0]]
Test_Set
                        precision    recall  f1-score   support

            agreements       0.00      0.00      0.00         7
             conflicts       0.29      1.00      0.45        40
            diplomatic       0.00      0.00      0.00        23
        economicEvents       0.00      0.00      0.00        16
      naturalDisasters       0.00      0.00      0.00        15
       politicalEvents       0.00      0.00      0.00        10
    societalChallenges       0.00      0.00      0.00        11
technologicalDisasters       0.00      0.00      0.00         2
                terror       0.00      0.00      0.00

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


##### K-Nearest Neighbor (KNN)

In [60]:
from sklearn.neighbors import KNeighborsClassifier

In [113]:
# Cross Validation takes long time
knn = KNeighborsClassifier()
param={'n_neighbors': [1, 3, 5, 7, 9, 11],
       'weights' : ['uniform', 'distance'],
       'metric' : ['euclidean', 'manhattan', 'minkowski']}
clf=GridSearchCV(knn,param,scoring='accuracy',cv=10,return_train_score=True)
clf.fit(X_train_tf_idf,y_train)



In [114]:
clf.best_params_

{'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'distance'}

In [61]:
knn = KNeighborsClassifier(n_neighbors=5,metric= 'euclidean',weights='distance')
knn.fit(X_train_tf_idf,y_train)

KNeighborsClassifier(metric='euclidean', weights='distance')

In [62]:
print("KNN MODEL")
eval(knn, X_train_tf_idf, X_test_tf_idf)

KNN MODEL
[[ 3  1  2  1  0  0  0  0  0  0]
 [ 0 39  0  1  0  0  0  0  0  0]
 [ 0  2 19  0  1  0  0  0  1  0]
 [ 1  0  1 13  1  0  0  0  0  0]
 [ 0  0  1  0 12  0  2  0  0  0]
 [ 0  1  0  0  0  8  0  0  0  1]
 [ 0  0  3  1  2  0  3  1  1  0]
 [ 0  0  0  0  0  0  0  0  1  1]
 [ 0  0  0  0  0  0  0  0 12  0]
 [ 0  1  0  0  0  0  0  0  0  2]]
Test_Set
                        precision    recall  f1-score   support

            agreements       0.75      0.43      0.55         7
             conflicts       0.89      0.97      0.93        40
            diplomatic       0.73      0.83      0.78        23
        economicEvents       0.81      0.81      0.81        16
      naturalDisasters       0.75      0.80      0.77        15
       politicalEvents       1.00      0.80      0.89        10
    societalChallenges       0.60      0.27      0.37        11
technologicalDisasters       0.00      0.00      0.00         2
                terror       0.80      1.00      0.89        12
         

##### Random Forest

In [63]:
from sklearn.ensemble import RandomForestClassifier

In [117]:
#Cross validation takes long time
rf = RandomForestClassifier(random_state = 42, n_jobs = -1)
param={'n_estimators': [25, 50, 100, 150, 200, 250],
       'max_features': ['sqrt', 'log2', None],
       'max_depth': [3, 6, 9, 12, 15],
       'max_leaf_nodes': [3, 6, 9, 12, 15]
           }
clf=GridSearchCV(rf,param,scoring='accuracy',cv=10,return_train_score=True)
clf.fit(X_train_tf_idf,y_train)



In [118]:
clf.best_params_

{'max_depth': 12,
 'max_features': None,
 'max_leaf_nodes': 15,
 'n_estimators': 250}

In [64]:
rf = RandomForestClassifier(250, max_depth=12,max_features=None, max_leaf_nodes= 15, random_state = 42, n_jobs = -1)
rf.fit(X_train_tf_idf, y_train)

RandomForestClassifier(max_depth=12, max_features=None, max_leaf_nodes=15,
                       n_estimators=250, n_jobs=-1, random_state=42)

In [65]:
print("RF MODEL")
eval(rf, X_train_tf_idf, X_test_tf_idf)

RF MODEL
[[ 2  0  2  2  0  0  1  0  0  0]
 [ 0 37  2  1  0  0  0  0  0  0]
 [ 0 10 12  0  1  0  0  0  0  0]
 [ 0  1  2 12  1  0  0  0  0  0]
 [ 0  1  0  0 13  0  1  0  0  0]
 [ 0  3  1  0  1  5  0  0  0  0]
 [ 0  4  5  1  1  0  0  0  0  0]
 [ 0  2  0  0  0  0  0  0  0  0]
 [ 0  0  0  1  0  0  0  0 11  0]
 [ 0  3  0  0  0  0  0  0  0  0]]
Test_Set
                        precision    recall  f1-score   support

            agreements       1.00      0.29      0.44         7
             conflicts       0.61      0.93      0.73        40
            diplomatic       0.50      0.52      0.51        23
        economicEvents       0.71      0.75      0.73        16
      naturalDisasters       0.76      0.87      0.81        15
       politicalEvents       1.00      0.50      0.67        10
    societalChallenges       0.00      0.00      0.00        11
technologicalDisasters       0.00      0.00      0.00         2
                terror       1.00      0.92      0.96        12
          

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


##### Adaboost

In [66]:
from sklearn.ensemble import AdaBoostClassifier

In [125]:
#Cross validation takes long time
ada = AdaBoostClassifier(random_state = 42)
param={'n_estimators': [25, 50, 100, 150, 200, 250],
       'algorithm': ['SAMME', 'SAMME.R']
       }
clf=GridSearchCV(ada,param,scoring='accuracy',cv=10,return_train_score=True)
clf.fit(X_train_tf_idf,y_train)




In [126]:
clf.best_params_

{'algorithm': 'SAMME', 'n_estimators': 250}

In [67]:
ada = AdaBoostClassifier(n_estimators= 250, algorithm= 'SAMME', random_state = 42)
ada.fit(X_train_tf_idf, y_train)

AdaBoostClassifier(algorithm='SAMME', n_estimators=250, random_state=42)

In [68]:
print("Ada MODEL")
eval(ada, X_train_tf_idf, X_test_tf_idf)

Ada MODEL
[[ 3  0  3  1  0  0  0  0  0  0]
 [ 0 37  2  0  0  0  1  0  0  0]
 [ 0 15  6  1  0  0  1  0  0  0]
 [ 0 10  2  3  0  0  1  0  0  0]
 [ 0 10  0  0  5  0  0  0  0  0]
 [ 0  2  4  1  0  3  0  0  0  0]
 [ 0  6  4  1  0  0  0  0  0  0]
 [ 0  2  0  0  0  0  0  0  0  0]
 [ 0  7  3  1  0  0  0  0  1  0]
 [ 0  1  2  0  0  0  0  0  0  0]]
Test_Set
                        precision    recall  f1-score   support

            agreements       1.00      0.43      0.60         7
             conflicts       0.41      0.93      0.57        40
            diplomatic       0.23      0.26      0.24        23
        economicEvents       0.38      0.19      0.25        16
      naturalDisasters       1.00      0.33      0.50        15
       politicalEvents       1.00      0.30      0.46        10
    societalChallenges       0.00      0.00      0.00        11
technologicalDisasters       0.00      0.00      0.00         2
                terror       1.00      0.08      0.15        12
         

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


##### Deep Learning

GRU is used in this section. GRU is a deep learning model for sequential data like time series or texts. Since tfidf does not hold the sequence information, we need to tokenize the text in an other way.

In [69]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, GRU, Embedding, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from sklearn.preprocessing import LabelEncoder

In [70]:
num_words = 1200
tokenizer = Tokenizer(num_words=num_words)

In [71]:
X.values

array(['alsarraj review eni official support libya electricity head libyan presidential council fayez alsarraj discussed monday italian oil giant eni official possible investment italian company libya development project area company operates support electricity sector discussion came meeting tripoli eni official alsarraj reviewed work italian company meeting held presence eni ceo claudio descalzi official italian company addition chairman national oil corporation mustafa sanallah',
       'noc eni review resuming stalled project chairman national oil corporation noc mustafa sanallah discussed ceo italian company eni resumption significant project stalled funding shortfall meeting held noc hq dealt way maintain production rate onshore offshore field increase capacity well supporting libyan energy sector level two party also discussed progress made offshore project e increase natural rate gas production bahr alsalam coming year secure local market supply gas according statement nocfor p

In [72]:
tokenizer.fit_on_texts(X.values)

In [73]:
tokenizer.word_index

{'libya': 1,
 'libyan': 2,
 'said': 3,
 'tripoli': 4,
 'government': 5,
 'oil': 6,
 'force': 7,
 'national': 8,
 'country': 9,
 'military': 10,
 'haftar': 11,
 'army': 12,
 'haftars': 13,
 'two': 14,
 'attack': 15,
 'khalifa': 16,
 'militia': 17,
 'also': 18,
 'new': 19,
 'case': 20,
 'gna': 21,
 'support': 22,
 'eastern': 23,
 'security': 24,
 'political': 25,
 'united': 26,
 'since': 27,
 'noc': 28,
 'ministry': 29,
 'state': 30,
 'statement': 31,
 'operation': 32,
 'company': 33,
 'meeting': 34,
 'foreign': 35,
 'u': 36,
 'capital': 37,
 'accord': 38,
 'control': 39,
 'one': 40,
 'coronavirus': 41,
 'day': 42,
 'people': 43,
 'turkey': 44,
 'power': 45,
 'air': 46,
 'city': 47,
 'ld': 48,
 'un': 49,
 'minister': 50,
 'reuters': 51,
 'year': 52,
 'total': 53,
 'last': 54,
 'lna': 55,
 'according': 56,
 'reported': 57,
 'nation': 58,
 'month': 59,
 'group': 60,
 'center': 61,
 'official': 62,
 'area': 63,
 'project': 64,
 'field': 65,
 'international': 66,
 'turkish': 67,
 'russia': 6

In [74]:
len(tokenizer.word_index)

6136

In [75]:
X_num_tokens = tokenizer.texts_to_sequences(X)


In [76]:
num_tokens = [len(tokens) for tokens in X_num_tokens]
num_tokens = np.array(num_tokens)

In [77]:
num_tokens.mean()

74.42857142857143

In [78]:
num_tokens.max()

142

In [79]:
max_tokens = 112

In [80]:
X_pad = pad_sequences(X_num_tokens, maxlen = max_tokens)

In [81]:
X_pad.shape

(462, 112)

In [82]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(df["Event Category"])

In [83]:
X_train, X_test, y_train, y_test = train_test_split(X_pad, y_encoded, test_size=0.3, stratify=y, random_state=101)

In [84]:
model = Sequential()
num_classes=len(df["Event Category"].unique())
embedding_size = 500

model.add(Embedding(input_dim=num_words,
                    output_dim=embedding_size,
                    input_length=max_tokens))
model.add(Dropout(0.2))

model.add(GRU(units=250, return_sequences=True))
model.add(Dropout(0.2))

model.add(GRU(units=100, return_sequences=True))
model.add(Dropout(0.2))

model.add(GRU(units=50, return_sequences=True))
model.add(Dropout(0.2))

model.add(GRU(units=25))
model.add(Dropout(0.2))

model.add(Dense(num_classes, activation='softmax'))




In [85]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [86]:
model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x23eacbab340>

In [87]:
test_loss, test_acc = model.evaluate(X_test, y_test)



In [88]:
predictions = model.predict(X_test)



In [89]:
predicted_labels = label_encoder.inverse_transform(tf.argmax(predictions, axis=1).numpy())

In [90]:
confusion_matrix(label_encoder.inverse_transform(y_test), predicted_labels)


array([[ 2,  0,  4,  1,  0,  0,  0,  0,  0,  0],
       [ 0, 29,  2,  1,  1,  5,  2,  0,  0,  0],
       [ 3,  4,  8,  1,  2,  2,  2,  0,  1,  0],
       [ 3,  1,  0,  9,  1,  0,  0,  0,  2,  0],
       [ 0,  0,  0,  0,  8,  1,  6,  0,  0,  0],
       [ 2,  0,  1,  0,  0,  3,  0,  0,  4,  0],
       [ 0,  1,  1,  1,  4,  1,  3,  0,  0,  0],
       [ 0,  2,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 2,  1,  0,  0,  0,  1,  1,  0,  7,  0],
       [ 0,  1,  0,  0,  0,  1,  0,  0,  1,  0]], dtype=int64)

In [91]:
test_acc

0.49640288949012756

Best results are yield by KNN. So we save KNN Model

In [157]:
with open('KNNModel.pkl', 'wb') as fin:
  pickle.dump(knn, fin)

### Classification for Sub Categories

Since it would be hard to train for every sub category, we used Zeroshot Learning for sub category classification

In [99]:
from transformers import pipeline
classifier = pipeline("zero-shot-classification")

No model was supplied, defaulted to roberta-large-mnli and revision 130fb28 (https://huggingface.co/roberta-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFRobertaForSequenceClassification.

All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


In [100]:
#KNN
def predictEventType(newdata):
  text=newdata['Headline'] + " " + newdata['Abstract'] + " " + newdata['First Part']
  text=cleantext(text)
  vector=tf_idf_vectorizer.transform([text]).toarray()
  return knn.predict(vector)

In [149]:
def predictSubCategories(newdata):
  event=predictEventType(newdata)[0]
  newdata=newdata['Headline'] + " " + newdata['Abstract'] + " " + newdata['First Part']
  flsub=classifier(newdata,
                   candidate_labels=df[df['Event Category']==event]['1st Level Sub Category'].unique()
    )
  fl=flsub["labels"][np.argmax(flsub["scores"])]

  if df[(df['Event Category']==event) & (df['1st Level Sub Category']==fl)]['2nd Level Sub Category'].nunique()>0:
    slsub=classifier(newdata,
                    candidate_labels=df[(df['Event Category']==event) & (df['1st Level Sub Category']==fl)]['2nd Level Sub Category'].unique()
      )
    sl=slsub["labels"][np.argmax(slsub["scores"])]
  elif df[(df['Event Category']==event) & (df['1st Level Sub Category']==fl)]['2nd Level Sub Category'].nunique()==1:
    sl=df[(df['Event Category']==event) & (df['1st Level Sub Category']==fl)]['2nd Level Sub Category'].unique()
  else:
    sl=np.NaN
    return event, fl, sl, np.NaN



  if df[(df['Event Category']==event) & (df['1st Level Sub Category']==fl) & (df['2nd Level Sub Category']==sl)]['3rd Level Sub Category'].nunique()>1:
    tlsub=classifier(newdata,
                    candidate_labels=df[(df['Event Category']==event) & (df['1st Level Sub Category']==fl) & (df['2nd Level Sub Category']==sl)]['3rd Level Sub Category'].unique()
    )
    tl=tlsub["labels"][np.argmax(tlsub["scores"])]
  elif df[(df['Event Category']==event) & (df['1st Level Sub Category']==fl) & (df['2nd Level Sub Category']==sl)]['3rd Level Sub Category'].nunique()==1:
    tl=df[(df['Event Category']==event) & (df['1st Level Sub Category']==fl) & (df['2nd Level Sub Category']==sl)]['3rd Level Sub Category'].unique()[0]
  else:
    tl=np.NaN
  return event, fl, sl, tl


In [150]:
newdata=df.iloc[350]
event, fl, sl, tl=predictSubCategories(newdata)
print("event:",event)
print("1st Level Sub Category:",fl)
print("2nd Level Sub Category:",sl)
print("3rd Level Sub Category:",tl)



event: politicalEvents
1st Level Sub Category: governmentChange
2nd Level Sub Category: nan
3rd Level Sub Category: nan


**Check It Yourself**

In [153]:
#Please write "Headline", "Abstract" and "First Part" of the event into the qutation marks

headline="Headline"
abstract="Abstract"
firstPart="First Part"

newdata=pd.DataFrame({"Headline":[headline],"Abstract":[abstract],"First Part":[firstPart]}).iloc[0]
event, fl, sl, tl=predictSubCategories(newdata)
print("event:",event)
print("1st Level Sub Category:",fl)
print("2nd Level Sub Category:",sl)
print("3rd Level Sub Category:",tl)

event: naturalDisasters
1st Level Sub Category: meteorological
2nd Level Sub Category: storm
3rd Level Sub Category: convectiveStorm


## Challenges

### 1)	Select one event type from the event category and develop your text classification model. Please keep in mind that, if you select the event, your model should predict the relevant sub-level categories. For example, event category: Conflict, 1st Level Sub Category: Armed Conflicts, 2nd Level Sub Category: Explosions/Remote Violence, 3rd Level Sub Category: Air/drone attack / air defense. We expect you to create a hierarchical text classification model. You can create one model or multiple models.

Hierarchical text classification is used to classificate documents into a hierarchical structure of categories. In hierarchical classification, categories are organized in tree-like structure. Each subclass is related to just one high level class but may have more than one subclass. In hierarchical text classification two types of training process may be applied.
<br>*Prograssive Training:* Model is trained firstly  for high level categories, then for low level categories
<br>*Joint Learning:* Model is trained for high and low level categories at once

### 2)	We expect to see your approach on both ML and DL. Present your model scores through evaluation metrics and ideal hyperparameters.

I have tried 6 Machine Learning and 1 Deep Learning models. I have applied cross validation for machine learning models and manually tried some hyperparameters for deep learning model since deep learning takes more time to run. I used accuracy metric for evaluation.

### 3)	You will work with a very small dataset. Please report the challenges while you are working with such kind of small datasets and explain your approaches to overcome this problem.

Small datasets are hard to work on. The models always have the probability to overfit on the data. To overcome this problem we may use data augmentation techniques but since some categories have so few data, data augmentation have big risk to generate misslabeled samples.

### 4)	The shared data is whole dataset (train + test dataset). Please clearly state the size of your train and test dataset.

When we have a big data set we can make test set portion smaller like %1-%2. But in case of small datasets like this one, we have to make dataset portion bigger.
<br> In this dataset we have 462 entries. we split the data into train and test sets 70% and 30%. We have 312 entry in training set and 139 entry in test set.
<br> Because the data is imbalanced, we use stratify=y to distribute the data proportionally. You have to have at least one sample from each category in test set or else we can not evaluate the model for that category.

### 5)	Which text classification model have you used and why?

I have used Naive Bayes, Logistic Regression, Support Vector Machines (SVM), K-Nearest Neighbor (KNN), Random Forest, Adaboost as Mashine Learning Algoritms and Gated Recurrent Unit (GRU) as a Deep Learning Algorithm to see which classification algorithm would yield a better result.
<br> I have chosen KNN with k=3 as the best algorithm for this dataset

### 6)	Explain hyperparameter tuning. How did hyperparameters affect your model?

Hyperparameter tuning is choosing the best hyperparameters for the classification models. I have used 10 fold cross validation for hyperparameter tuning. K-fold cross validation tries every hyperparameter given by the user and chooses the best ones. To eleminate the effects of the randomness the algorithm splits the data into K parts, uses one part for validation and the rest for training. It changes the validation part one by one at each step. 

### 7)	Do not hesitate to apply the most up-to-date solutions in the field of NLP.

Since the dataset is small best way for classification is using Large Language Models.
<br> We may use pretrained models and pretrained embeddings then fine tune the model for our dataset.
<br> Another way is to use Zeroshot Learning.
<br> We have used zeroshot Learning for sub level categories