In this notebook we describe the code use to produce the baseline. 

# Data and Libraries

In [1]:
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import numpy as np

#DATA_PATH = "/kaggle/input/defi-ia-insa-toulouse"
DATA_PATH = '../Data'

train_df = pd.read_json(DATA_PATH+"/train.json")
test_df = pd.read_json(DATA_PATH+"/test.json")
train_label = pd.read_csv(DATA_PATH+"/train_label.csv")

# Cleaning

The only cleaning transformation applied here is that we `lower` the data so that all words are lower case. 
Hence `research`and `Research` will be considered as similar word.

You might want to look at other cleaning step such that removing stopwords, stemming words, etc.

In [2]:
train_df["description_lower"] = [x.lower() for x in train_df.description]
test_df["description_lower"] = [x.lower() for x in test_df.description]

# Vectorization

We use TfidfVectorizer to transform words from text to numerical vector data.  

More vectorize are available on scikit-learn -> https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text

You also may want to have a look at words embedding methods (Word2vec, Glove, etc..)

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
transformer = TfidfVectorizer().fit(train_df["description_lower"].values)
print("NB features: %d" %(len(transformer.vocabulary_)))
X_train = transformer.transform(train_df["description_lower"].values)
X_test = transformer.transform(test_df["description_lower"].values)
X_train

NB features: 230368


<217197x230368 sparse matrix of type '<class 'numpy.float64'>'
	with 9851657 stored elements in Compressed Sparse Row format>

# Train/Test Split

In [22]:
from sklearn.model_selection import train_test_split
Y_train = train_label.Category.values

x_train,x_test,y_train,y_test = train_test_split(X_train,Y_train,test_size=0.2,random_state=0)

In [19]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

# Learning

We use a simple Logistic Regression model with scikit learn default arguments'value to train the baseline model. 

In [4]:
from sklearn.linear_model import LogisticRegression
Y_train = train_label.Category.values
model = LogisticRegression()
model.fit(X_train, Y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [21]:
from sklearn.metrics import f1_score

pred = model.predict(x_test)
score_f1 = f1_score(y_test,pred,average='macro')
print("Score f1:",score_f1)

Score f1: 0.730753874438092


# Prediction

In [5]:
predictions = model.predict(X_test)
predictions

array([ 6, 20, 24, ..., 25, 26, 15], dtype=int64)

In [10]:
pred2 = model.predict(X_train)
pred2

array([19,  3, 19, ..., 19, 19,  1], dtype=int64)

# File Generation

In [None]:
test_df["Category"] = predictions
baseline_file = test_df[["Id","Category"]]
baseline_file.to_csv("/kaggle/working/baseline.csv", index=False)