In this notebook we describe the code use to produce the baseline. 

# Data and Libraries

In [5]:
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import numpy as np

#DATA_PATH = "/kaggle/input/defi-ia-insa-toulouse"
DATA_PATH = '../Data'

train_df = pd.read_json(DATA_PATH+"/train.json").set_index('Id')
test_df = pd.read_json(DATA_PATH+"/test.json").set_index('Id')
train_label = pd.read_csv(DATA_PATH+"/train_label.csv").set_index('Id')

In [6]:
train_df.head()

Unnamed: 0_level_0,description,gender
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,She is also a Ronald D. Asmus Policy Entrepre...,F
1,He is a member of the AICPA and WICPA. Brent ...,M
2,Dr. Aster has held teaching and research posi...,M
3,He runs a boutique design studio attending cl...,M
4,"He focuses on cloud security, identity and ac...",M


# Train/Test Split

In [7]:
from sklearn.model_selection import train_test_split
X = train_df
Y = train_label


X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=0)

In [8]:
#ajout de colonnes avec label dans X_train :
X_train["label"]=Y_train.iloc[:,0]
category_df = pd.read_csv(DATA_PATH+"/categories_string.csv")
X_train["category_name"] = [category_df[category_df["1"]==x].values[0][0] for x in X_train["label"].values]
X_train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0_level_0,description,gender,label,category_name
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
198219,She is trained in cognitive behavior therapy ...,F,22,psychologist
1460,He spent over 8 years in PWC’s Audit and Advi...,M,9,accountant
126976,For a number of years Haig has been exploring...,M,19,professor
184223,"Prior to joining the firm, Michelle was the h...",F,26,attorney
144124,While his early works were greatly influenced...,M,25,composer


In [9]:
d1 = pd.DataFrame(np.transpose([Y_train.Category.value_counts().values/Y_train.shape[0]]), index = Y_train.Category.value_counts().index, columns=['Y_train'])
d2 = pd.DataFrame(np.transpose([Y_test.Category.value_counts().values/Y_test.shape[0]]), index = Y_test.Category.value_counts().index, columns=['Y_test'])
d = pd.concat([d1,d2],axis=1,sort=False)
d

Unnamed: 0,Y_train,Y_test
0,0.006877,0.006952
1,0.019107,0.018301
2,0.004224,0.004834
3,0.042041,0.042357
4,0.003758,0.003545
5,0.021173,0.021685
6,0.056665,0.056377
7,0.004006,0.003729
8,0.030174,0.031607
9,0.014181,0.015124


# Cleaning

The only cleaning transformation applied here is that we `lower` the data so that all words are lower case. 
Hence `research`and `Research` will be considered as similar word.

You might want to look at other cleaning step such that removing stopwords, stemming words, etc.

In [10]:
import unicodedata 
import re
import nltk
import time

In [11]:
def clean_txt(txt):
    txt = txt.lower()
    txt = unicodedata.normalize('NFD', txt).encode('ascii', 'ignore').decode("utf-8")
    txt = re.sub('[^a-z_]', ' ', txt)
    english_stopwords = nltk.corpus.stopwords.words('english')
    additional_stopwords = ["work","interest","year","currently","including","received","focus"]
    stopwords = [unicodedata.normalize('NFD', sw).encode('ascii', 'ignore').decode("utf-8") for sw in english_stopwords+additional_stopwords]
    tokens = [w for w in txt.split() if (w not in stopwords)]
    stemmer=nltk.stem.SnowballStemmer('english')
    tokens = [stemmer.stem(token) for token in tokens]
    return tokens 

In [12]:
%%time
X_train["description_cleaned"] = [" ".join(clean_txt(x)) for x in X_train["description"].values]
X_test["description_cleaned"] = [" ".join(clean_txt(x)) for x in X_test["description"].values]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Wall time: 3min 29s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


# Vectorization

We use TfidfVectorizer to transform words from text to numerical vector data.  

More vectorize are available on scikit-learn -> https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text

You also may want to have a look at words embedding methods (Word2vec, Glove, etc..)

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
transformer = TfidfVectorizer().fit(X_train["description_cleaned"].values)
print("NB features: %d" %(len(transformer.vocabulary_)))
X_train_vect = transformer.transform(X_train["description_cleaned"].values)
X_test_vect = transformer.transform(X_test["description_cleaned"].values)
X_train_vect

NB features: 153106


<173757x153106 sparse matrix of type '<class 'numpy.float64'>'
	with 5383957 stored elements in Compressed Sparse Row format>

# Learning

We use a simple Logistic Regression model with scikit learn default arguments'value to train the baseline model. 

In [None]:
%%time
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=500,n_jobs=-1)
model.fit(X_train_vect, Y_train.Category.values)

# Prediction

In [None]:
predictions = model.predict(X_test_vect)
predictions

In [None]:
pred2 = model.predict(X_train_vect)
pred2

In [None]:
from sklearn.metrics import f1_score
score_f1 = f1_score(Y_test.Category.values,predictions,average='macro')
print("Score f1:",score_f1)

# File Generation

In [None]:
test_df["Category"] = predictions
baseline_file = test_df[["Id","Category"]]
baseline_file.to_csv("/kaggle/working/baseline.csv", index=False)