### The goal of this notebook is to showcase a cheap, fast manner to handle  text classification tasks without any fancy hardware. 

#### PS: I am using my Macbook Air. 

#### Step 1: 
    
    Transform the text using TF-IDF feature extracter by using character n_gram range betwen 1 and 2.
    
#### Step 2: 
    
    Reduce the TF-IDF vectors using Truncated SVD by capturing the maximum level of variance. The go to method for    sparse matrices instead of PCA. 
    
#### Step 3: 
    
    Run a logistic regression Model on the newly truncated vectors. 
    


In [12]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

### Load the data files

In [14]:
train = pd.read_csv("train_lang.csv")
valid = pd.read_csv("valid_lang.csv")
test = pd.read_csv("test_lang.csv")

### Build TF-IDF feature vectors. 

Building features on top of characters instead of words makes more sense for language detection since some languages have intrinsic characters that others don't, also we can use bi-grams while still keeping the dimensionality low meaning not in the 100,000. Hence my macbook air will not run into memory errors when running the svd algorithm.

In [3]:
tfidf = TfidfVectorizer(ngram_range=(1,2), analyzer='char')
train_tfidf = tfidf.fit_transform(train["text"].values)
valid_tfidf = tfidf.transform(valid["text"].values)
test_tfidf = tfidf.transform(test["text"].values)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [4]:
train_tfidf.shape # check the dimensionality of the data

(84000, 6764)

### Reduce the dimensionality to 400 featurs. 
The n_components parameter was tuned to ensure a good performance on the Logistic regression. Using the arpack prevented my laptop from running into memory error. 



In [5]:
svd = TruncatedSVD(n_components = 400, algorithm = "arpack" )
train_svd = svd.fit_transform(train_tfidf)
valid_svd = svd.transform(valid_tfidf)
test_svd = svd.transform(test_tfidf)

Calculate how much variance are we capturing the truncated features are capturing. Usually a ratio of 0.9 is recommeneded
                                

In [6]:
svd.explained_variance_ratio_.sum() 

0.781500797964849

### Build Logistic regression model and get the validation results. 

Since this is a multiclass classification problem, precision, recall and F1 score is are the right evaluation metrics.



In [7]:
lr = LogisticRegression()
lr.fit(train_svd, train["lang"])
valid_preds = lr.predict(valid_svd)
print(classification_report(valid["lang"], valid_preds))



             precision    recall  f1-score   support

         bg       1.00      1.00      1.00      1000
         cs       0.97      0.96      0.97      1000
         da       0.98      0.98      0.98      1000
         de       0.98      0.98      0.98      1000
         el       1.00      0.98      0.99      1000
         en       0.95      0.99      0.97      1000
         es       0.98      0.98      0.98      1000
         et       0.98      0.98      0.98      1000
         fi       0.99      0.99      0.99      1000
         fr       0.98      0.98      0.98      1000
         hu       0.99      0.99      0.99      1000
         it       0.97      0.99      0.98      1000
         lt       0.98      0.98      0.98      1000
         lv       1.00      0.98      0.99      1000
         nl       0.98      0.97      0.98      1000
         pl       1.00      0.99      0.99      1000
         pt       0.98      0.98      0.98      1000
         ro       0.99      0.99      0.99   

In [8]:
print(classification_report(test["lang"], lr.predict(test_svd)))



             precision    recall  f1-score   support

         bg       1.00      1.00      1.00      1000
         cs       0.99      0.99      0.99      1000
         da       0.99      0.99      0.99      1000
         de       0.99      1.00      0.99      1000
         el       1.00      1.00      1.00      1000
         en       0.99      1.00      1.00      1000
         es       0.99      0.99      0.99      1000
         et       1.00      0.99      0.99      1000
         fi       1.00      1.00      1.00      1000
         fr       1.00      0.99      0.99      1000
         hu       1.00      1.00      1.00      1000
         it       0.99      1.00      0.99      1000
         lt       1.00      1.00      1.00      1000
         lv       1.00      1.00      1.00      1000
         nl       0.99      0.99      0.99      1000
         pl       1.00      1.00      1.00      1000
         pt       0.99      0.99      0.99      1000
         ro       1.00      1.00      1.00   