<center><a href="#"><img src = "https://3d-media.pro/images/Logo-DATA-For-Developpement.png" width = 250, align = "center"></a></center>

<h1 style="text-align:center;"> Project: Develop an end-to-end Machine Learning Pipeline </h1> 

<h3 style="text-align:center;">Instructor: Assan Sanogo</h3>

<h3>Project Overview:</h3>
<p>This project is based on a dataset of 7000+ essays graded by English specialists. This data problem is close to a real-world situation as it requires to be cleaned, an EDA must be thoroughly done so that the team can engineer relevant features.</p>
<p>This project is a NLP problem that will be the foundation of an English program used by the company Easy Sailing Language Training. Their ambition is to have a reliable tool to assess new students’ ability to write in English according to the IELTS grading system. In turn it would help prospective students in knowing how much time they need to invest to get to the next level.</p>

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
    * [data processing](#dataprocessing)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

### Introduction: Business Problem <a name="introduction"></a>
<p><strong>DETEMLP</strong> is a project that aims to develop an end-to-end pipeline to process essays and output a grade describing the level of English proficiency. This project is based on a dataset of 7000+ essays graded by English specialists. </p>
<p>The goal is to <strong>create a reliable tool to assess new students’ ability to write in English according to the IELTS grading system</strong>. In this project, we’ll be using data processing, data cleaning, and NLP techniques, including the librairie Spacy. If during this trip we struggle with the dataset, we might reframe the problem as a classification problem.</p>
<p>Let’s dive in</p>

### Data<a name="data"></a>
<p>
Our data collection has been really simplify, here the list of our dataset : </p>
<ul>
<li>test_set.tsv</li>
<li>training_set_rel3.tsv</li>
<li>training_set_rel3.xls</li>
<li>training_set_rel3.xlsx</li>
<li>valid_sample_submission_1_column.csv</li>
<li>valid_sample_submission_1_column_no_header.csv</li>
<li>valid_sample_submission_2_column.csv</li>
<li>valid_sample_submission_5_column.csv</li>
<li>valid_set.tsv</li>
<li>valid_set.xls</li>
<li>valid_set.xlsx </li>
</ul>

<p></p>

<i>Data processing<a name="dataprocessing"></a></i>

In [3]:
# import librairies
import pandas as pd
import numpy as np

In [4]:
# take a look to valid set
df_valid_set = pd.read_csv("valid_set.tsv",sep="\t",encoding="latin1")
df_valid_set.head()

Unnamed: 0,essay_id,essay_set,essay,domain1_predictionid,domain2_predictionid
0,1788,1,"Dear @ORGANIZATION1, @CAPS1 more and more peop...",1788,
1,1789,1,Dear @LOCATION1 Time @CAPS1 me tell you what I...,1789,
2,1790,1,"Dear Local newspaper, Have you been spending a...",1790,
3,1791,1,"Dear Readers, @CAPS1 you imagine how life woul...",1791,
4,1792,1,"Dear newspaper, I strongly believe that comput...",1792,


In [9]:
df_valid_set.describe()

Unnamed: 0,essay_id,essay_set,domain1_predictionid,domain2_predictionid
count,4218.0,4218.0,4218.0,600.0
mean,11282.44642,4.123518,13735.433618,7178.0
std,6173.633131,2.117188,6964.020021,346.698716
min,1788.0,1.0,1788.0,6579.0
25%,5243.25,2.0,7508.5,6878.5
50%,10995.5,4.0,13995.5,7178.0
75%,16852.75,6.0,19852.75,7477.5
max,21938.0,8.0,24938.0,7777.0


In [10]:
df_valid_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4218 entries, 0 to 4217
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   essay_id              4218 non-null   int64  
 1   essay_set             4218 non-null   int64  
 2   essay                 4218 non-null   object 
 3   domain1_predictionid  4218 non-null   int64  
 4   domain2_predictionid  600 non-null    float64
dtypes: float64(1), int64(3), object(1)
memory usage: 164.9+ KB


In [7]:
# valid sample submission
df_sample_submission = pd.read_csv("valid_sample_submission_5_column.csv",encoding="latin1")
df_sample_submission.head()

Unnamed: 0,prediction_id,essay_id,essay_set,essay_weight,predicted_score
0,1788,1788,1,1.0,7
1,1789,1789,1,1.0,8
2,1790,1790,1,1.0,9
3,1791,1791,1,1.0,9
4,1792,1792,1,1.0,9


In [8]:
# training set
df_training_set = pd.read_csv("training_set_rel3.tsv",sep="\t",encoding="latin1")
df_training_set.head() 

Unnamed: 0,essay_id,essay_set,essay,rater1_domain1,rater2_domain1,rater3_domain1,domain1_score,rater1_domain2,rater2_domain2,domain2_score,...,rater2_trait3,rater2_trait4,rater2_trait5,rater2_trait6,rater3_trait1,rater3_trait2,rater3_trait3,rater3_trait4,rater3_trait5,rater3_trait6
0,1,1,"Dear local newspaper, I think effects computer...",4,4,,8,,,,...,,,,,,,,,,
1,2,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",5,4,,9,,,,...,,,,,,,,,,
2,3,1,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",4,3,,7,,,,...,,,,,,,,,,
3,4,1,"Dear Local Newspaper, @CAPS1 I have found that...",5,5,,10,,,,...,,,,,,,,,,
4,5,1,"Dear @LOCATION1, I know having computers has a...",4,4,,8,,,,...,,,,,,,,,,


In [11]:
df_training_set.describe()

Unnamed: 0,essay_id,essay_set,rater1_domain1,rater2_domain1,rater3_domain1,domain1_score,rater1_domain2,rater2_domain2,domain2_score,rater1_trait1,...,rater2_trait3,rater2_trait4,rater2_trait5,rater2_trait6,rater3_trait1,rater3_trait2,rater3_trait3,rater3_trait4,rater3_trait5,rater3_trait6
count,12976.0,12976.0,12976.0,12976.0,128.0,12976.0,1800.0,1800.0,1800.0,2292.0,...,2292.0,2292.0,723.0,723.0,128.0,128.0,128.0,128.0,128.0,128.0
mean,10295.395808,4.179485,4.127158,4.137408,37.828125,6.800247,3.333889,3.330556,3.333889,2.444154,...,2.635689,2.710297,3.777317,3.589212,3.945312,3.890625,4.078125,3.992188,3.84375,3.617188
std,6309.074105,2.136913,4.212544,4.26433,5.240829,8.970705,0.729103,0.726807,0.729103,1.21173,...,1.142566,1.045795,0.689401,0.693256,0.643668,0.63039,0.622535,0.509687,0.538845,0.603417
min,1.0,1.0,0.0,0.0,20.0,0.0,1.0,1.0,1.0,0.0,...,0.0,0.0,1.0,1.0,2.0,2.0,2.0,3.0,2.0,2.0
25%,4438.75,2.0,2.0,2.0,36.0,2.0,3.0,3.0,3.0,2.0,...,2.0,2.0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,3.0
50%,10044.5,4.0,3.0,3.0,40.0,3.0,3.0,3.0,3.0,2.0,...,2.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
75%,15681.25,6.0,4.0,4.0,40.0,8.0,4.0,4.0,4.0,3.0,...,4.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
max,21633.0,8.0,30.0,30.0,50.0,60.0,4.0,4.0,4.0,6.0,...,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,5.0,5.0


In [12]:
df_training_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12976 entries, 0 to 12975
Data columns (total 28 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   essay_id        12976 non-null  int64  
 1   essay_set       12976 non-null  int64  
 2   essay           12976 non-null  object 
 3   rater1_domain1  12976 non-null  int64  
 4   rater2_domain1  12976 non-null  int64  
 5   rater3_domain1  128 non-null    float64
 6   domain1_score   12976 non-null  int64  
 7   rater1_domain2  1800 non-null   float64
 8   rater2_domain2  1800 non-null   float64
 9   domain2_score   1800 non-null   float64
 10  rater1_trait1   2292 non-null   float64
 11  rater1_trait2   2292 non-null   float64
 12  rater1_trait3   2292 non-null   float64
 13  rater1_trait4   2292 non-null   float64
 14  rater1_trait5   723 non-null    float64
 15  rater1_trait6   723 non-null    float64
 16  rater2_trait1   2292 non-null   float64
 17  rater2_trait2   2292 non-null  