## CommonLit Readability Prize

- *Motivation*
In this competition, you’ll build algorithms to rate the complexity of reading passages for grade 3-12 classroom use. To accomplish this, you'll pair your machine learning skills with a dataset that includes readers from a wide variety of age groups and a large collection of texts taken from various domains. Winning models will be sure to incorporate text cohesion and semantics.

- *Problem* 
So this competition has given mainly text and we need to predict a difficulty score for the model . So this is a supervised regression problem. 

- *My Approach* 
Always start simple algorithim first so I am going to implement a simple TF-IDF vectorizer for extracting features from the model and then fit it with a simple Ridge Regression . With this approach I was able to get 0.7 on the private leader board. 

Also I have not optimized the model yet. 

Below notbook documents my approach


In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

%matplotlib inline 

import nltk
import re
import warnings 
warnings.filterwarnings('ignore')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import linear_model,metrics

import xgboost as xgb
import joblib
import pickle

### Import the data

In [None]:
df_train=pd.read_csv('../input/commonlitreadabilityprize/train.csv')

In [None]:
df_train.head()

In [None]:
df_train.shape

In [None]:
df_train['target'].hist();

## Target is normally distributed and also kind of scaled .

In [None]:
### Just keep the regular columns 
df_train=df_train[['excerpt','target']]

In [None]:
###  . Lets fit a simple Liner regression model to this dataset and see the score . I am going to do some basic text cleaning
import spacy 
nlp=spacy.load('en_core_web_sm')
stopwords=nlp.Defaults.stop_words

import nltk 
from nltk.tokenize import word_tokenize


def preprocess(df):
    df.loc[:,'cleaned']=df['excerpt'].apply(lambda x: str(x))      #convert to string 
    df.loc[:,'cleaned']=df['cleaned'].apply(lambda x: x.lower())   #lowercase the words 
    df.loc[:,'cleaned']=df['cleaned'].apply(lambda x: re.sub(r'[^\w\s]','',x)) #removes punctuation and spaces ,newline,tab
    df.loc[:,'cleaned']=df['cleaned'].apply(lambda x: word_tokenize(x)) # split words 
    df.loc[:,'cleaned']=df['cleaned'].apply(lambda x: [word for word in x  if word not in stopwords]) #remove stopwords 
    df.loc[:,'cleaned']=df['cleaned'].apply(lambda x: " ".join(x)) # join back 

In [None]:
preprocess(df=df_train)

In [None]:
df_train.head()

In [None]:
### Create folds 

from sklearn import model_selection
df_train['fold']=-1

kf=model_selection.KFold(n_splits=3,shuffle=True, random_state=42)

for fold,(train_index,valid_index) in enumerate(kf.split(df_train['cleaned'])):
    df_train.loc[valid_index,'fold']=fold

In [None]:
##3 Let see if TF-IDF vectorizer with Ridge regrssion works well 
## Building the custom cross validation loop 

train_RMSE=[]
val_RMSE=[]
i=1
for fold in np.arange(df_train['fold'].nunique()):
    
    X_train,y_train= df_train[df_train['fold']!=fold]['cleaned'],df_train[df_train['fold']!=fold]['target']
    X_val,y_val    = df_train[df_train['fold']==fold]['cleaned'],df_train[df_train['fold']==fold]['target']
    
    tfidf=TfidfVectorizer(lowercase=False,tokenizer=word_tokenize,ngram_range=(1,4),max_features=1000)
    
    tfidf.fit(X_train)
    train_transform=tfidf.transform(X_train)
    val_transform=tfidf.transform(X_val)
    
    lr=linear_model.Ridge()
    lr.fit(train_transform,y_train)
    
    train_preds=lr.predict(train_transform)
    
    val_preds=lr.predict(val_transform)
    
    train_score =metrics.mean_squared_error(y_train,train_preds)
    val_score   =metrics.mean_squared_error(y_val,val_preds)
    
    train_RMSE.append(train_score)
    val_RMSE.append(val_score)
    
    
    print(f"*** FINISHED FOLD {i} Train_RMSE={train_score} and Valid_RMSE={val_score} ***")
    i=i+1

## Dump model for prediction

In [None]:
## Train a model for prediction 


tfidf=TfidfVectorizer(lowercase=False,tokenizer=word_tokenize,ngram_range=(1,4),max_features=1000)

transformed_text=tfidf.fit_transform(df_train['cleaned'])

model=linear_model.Ridge()
model.fit(transformed_text,df_train['target'])

# save the model to disk
filename = 'Ridge_Regression_1000features.sav'
joblib.dump(model, filename)

pickle.dump(tfidf, open("tfidf.pickle", "wb"))