# Solution of the coding challenge


## Prerequisites to run the code
In main repository there is file named requirements.txt. This file contains all the required packages to run the code. To install all the required packages, run the requirements.txt. 

There two main modules in this project:
1. preProcessing.py
2. nlpClassifier.py

Let's briefly explain the functionality of each module.
## preProcessing Module
The preProcessing module is designed to handle the preprocessing tasks for textual data. It includes functionalities such as reading JSON files, extracting relevant information, performing text preprocessing (tokenization, stopword removal, etc.), and saving the processed data into a structured format. This module provides a clean and organized way to prepare textual data for further analysis and modeling.

## nlpClassifier Module
The nlpClassifier module focuses on natural language processing (NLP) tasks, specifically text classification. It utilizes the spaCy library for generating document vectors and employs a logistic regression classifier to train and evaluate models. The module encapsulates functionalities for loading and splitting data, extracting document vectors, training the classifier, and evaluating its performance using metrics like accuracy and precision. Additionally, it integrates topic modeling using spaCy for deeper insights into the content of the textual data.

These modules are designed to work together seamlessly, with preProcessing providing clean and processed data, and nlpClassifier leveraging this data to build and evaluate text classification models with additional insights gained through topic modeling.


In [25]:
# import preProcessing module
from preProcessing import PreProcessing as pp
import pandas as pd


#class `PreProcessing` gets the path of the json file and name of the class as an input. It then can output a pandas dataframe with the following columns:
# processed text along with the original text, the class name, and the class label.

# preprocess the json files and get the df of the data
#initialize the class

df_file1=  pp('./data_query_from_9367.json','9367')
df_file2=  pp('./data_query_from_9578.json','9578')


In the cell above we have successfully preprocessed the json files and created two dataframes. Now we can merge the two dataframes into one dataframe.

In [30]:
# join the two dataframes
joined_df= df_file1.join_dataframes(df_file1.df,df_file2.df)
joined_df.shape
#joined_df.tail()

(523, 3)

we were able to parse the JSON files and create a dataframe in structured format. Note we have added the class of the document  in new column `class`. Now we can save the dataframe into a csv file,  so that system don't have to apply text preprocessing every time it wants to use the data.


Let's clean the data tokenize the data so that embeddings vector can be generated. As machine only understands the number.

In [31]:
joined_df['Processed_Text'] = joined_df['Full_Text'].apply(df_file1.preprocess_text)

In [36]:
print(joined_df.shape)
joined_df.head()

(523, 4)


Unnamed: 0,Article_ID,Full_Text,class,Processed_Text
0,423340639,Desired start date: ASAP Duration of the missi...,9367,desire start date asap duration mission year l...
1,423341081,Desired start date: 16/10/2023 Duration of the...,9367,desire start date duration mission month poten...
2,423382015,Florida Health has recorded locally acquired d...,9367,florida health record locally acquire dengue c...
3,423324199,About CRS. Catholic Relief Services is the off...,9367,crs catholic relief services official internat...
4,423385532,"Attachments Au Niger, dans sa mission de fourn...",9367,attachments au niger dans sa mission de fourni...


As we can see in the column `Processed_Text` the text is tokenized and cleaned. Now we can save the dataframe into a csv file.
it is necessary to save the dataframe into a csv file so that we don't have to apply text preprocessing every time we want to use the data.


In [40]:
joined_df.to_csv('processed_data.csv', index=False)
print(joined_df.head())
print(joined_df.tail())
print(joined_df.shape)

  Article_ID                                          Full_Text class  \
0  423340639  Desired start date: ASAP Duration of the missi...  9367   
1  423341081  Desired start date: 16/10/2023 Duration of the...  9367   
2  423382015  Florida Health has recorded locally acquired d...  9367   
3  423324199  About CRS. Catholic Relief Services is the off...  9367   
4  423385532  Attachments Au Niger, dans sa mission de fourn...  9367   

                                      Processed_Text  
0  desire start date asap duration mission year l...  
1  desire start date duration mission month poten...  
2  florida health record locally acquire dengue c...  
3  crs catholic relief services official internat...  
4  attachments au niger dans sa mission de fourni...  
    Article_ID                                          Full_Text class  \
363  423310860  Η ανοσοθεραπεία στην ογκολογία-αιματολογία στο...  9578   
364  423308521  Herring with garlic and terrible nausea: Musul...  9578   
365  4

Everything look good. we have ensured with `head` and `tail` functions that the data look usual and processed correctly. Now we can move on to the next step.


## 1.2 Training  and evaluation of  a text classification model

In [64]:
from nlpClassifier import NLPClassifier
# initialize the class
classifier = NLPClassifier()

In [65]:
# load the data
classifier.load_data('processed_data.csv')
print(classifier.df.head())


   Article_ID                                          Full_Text  class  \
0   423340639  Desired start date: ASAP Duration of the missi...   9367   
1   423341081  Desired start date: 16/10/2023 Duration of the...   9367   
2   423382015  Florida Health has recorded locally acquired d...   9367   
3   423324199  About CRS. Catholic Relief Services is the off...   9367   
4   423385532  Attachments Au Niger, dans sa mission de fourn...   9367   

                                      Processed_Text  
0  desire start date asap duration mission year l...  
1  desire start date duration mission month poten...  
2  florida health record locally acquire dengue c...  
3  crs catholic relief services official internat...  
4  attachments au niger dans sa mission de fourni...  


In [73]:
# split the data into train and test
classifier.split_data()
print('The shape of the train and test data is:')
print(classifier.X_train.shape, classifier.X_test.shape, classifier.y_train.shape, classifier.y_test.shape)
#print train,test and label data
print('##############################################')
print(classifier.X_train.head())
print(classifier.X_test.head())
print(classifier.y_train.head())
print(classifier.y_test.head())


The shape of the train and test data is:
(418,) (105,) (418,) (105,)
##############################################
204    o principal suspeito e esfaquear um homem meio...
385    cd de méxico octubre en el marco del día de la...
249    au zimbabwé depuis septembre les autorités loc...
92     new york october good afternoon good exciting ...
426    lázaro cárdenas de octubre de el gobierno de m...
Name: Processed_Text, dtype: object
187    foto reprodução freepik nas plataformas de míd...
460    w spotkaniu uczestniczyli przedstawiciele sena...
491    queensland mum year old daughter hospital meni...
306    distintas dinámicas realizan los equipos de pr...
440    paciente sente dore que estão ligada problemas...
Name: Processed_Text, dtype: object
204    9578
385    9578
249    9578
92     9367
426    9578
Name: class, dtype: int64
187    9578
460    9578
491    9578
306    9578
440    9578
Name: class, dtype: int64


In [75]:
# generate document vectors
X_train_vectors, X_test_vectors = classifier.prepare_data()

In [76]:
# train the classifier
classifier.train_model(X_train_vectors, classifier.y_train)

In [77]:
# evaluate the classifier
classifier.evaluate_model(X_test_vectors, classifier.y_test)

In [78]:
# print the classification report
classifier.print_results()

Accuracy: 0.97
Precision: 0.97


we got very good results. Our logistic model can predict from given text with 97% accuracy about which file the text belongs to. 


In [80]:
# Let's do some topic modeling or named entity recognition for our documents
classifier.perform_topic_modeling()


Topics:
Topic 1: ['year', 'kabul solidarite international si international humanitarian aid association', 'afghanistan', 'soviet', 'afghan', 'afghanistan', 'wardak bamiyan', 'khost paktika samangan kunduz province organization', 'afghanistan', 'december', 'kapisa wardak', 'kabul', 'nimroz farah province', 'nimroz farah', 'province si', 'annual budget million euro', 'south west', 'dcd si', 'pm', 'dcdp', 'month', 'kabul', 'rrm department', 'mid summer', 'afghanistan', 'eu bha usaid', 'un', 'unhcr', 'english', 'dari pashto asset', 'month', 'annual', 'monthly', 'monthly', 'august', 'working day month', 'afghanistan', 'august', 'iskp', 'kabul', 'kabul', 'sl solidarités international si est', 'prévenir et à combattre', 'des membres des communautés bénéficiaire ou de ses collaborateur et', 'collaboratrice atteinte aux personnes et ou aux bien', 'non déclaré', 'atteinte aux droits de qui pourrait être perpétré dans le cadre de ses intervention si applique', 'des actes de seah solidarités inter

See our topic modeling function can give Named Entity Recognition (NER) for our documents. This can be very useful for our system to understand the content of the document.
we can further  use this information to classify the documents into different categories.

**End of the coding challenge**