In this notebook we describe the code use to produce the baseline. 

# Data and Libraries

In [1]:
import pandas as pd
import pickle

DATA_PATH = "/kaggle/input/defi-ia-insa-toulouse"
train_df = pd.read_json(DATA_PATH+"/train.json")
test_df = pd.read_json(DATA_PATH+"/test.json")
train_label = pd.read_csv(DATA_PATH+"/train_label.csv")

# Cleaning

The only cleaning transformation applied here is that we `lower` the data so that all words are lower case. 
Hence `research`and `Research` will be considered as similar word.

You might want to look at other cleaning step such that removing stopwords, stemming words, etc.

In [2]:
train_df["description_lower"] = [x.lower() for x in train_df.description]
test_df["description_lower"] = [x.lower() for x in test_df.description]

# Vectorization

We use TfidfVectorizer to transform words from text to numerical vector data.  

More vectorize are available on scikit-learn -> https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text

You also may want to have a look at words embedding methods (Word2vec, Glove, etc..)

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
transformer = TfidfVectorizer().fit(train_df["description_lower"].values)
print("NB features: %d" %(len(transformer.vocabulary_)))
X_train = transformer.transform(train_df["description_lower"].values)
X_test = transformer.transform(test_df["description_lower"].values)
X_train

NB features: 230368


<217197x230368 sparse matrix of type '<class 'numpy.float64'>'
	with 9851657 stored elements in Compressed Sparse Row format>

# Learning

We use a simple Logistic Regression model with scikit learn default arguments'value to train the baseline model. 

In [None]:
from sklearn.linear_model import LogisticRegression
Y_train = train_label.Category.values
model = LogisticRegression()
model.fit(X_train, Y_train)

# Prediction

In [None]:
predictions = model.predict(X_test)
predictions

# File Generation

In [None]:
test_df["Category"] = predictions
baseline_file = test_df[["Id","Category"]]
baseline_file.to_csv("/kaggle/working/baseline.csv", index=False)