
The Gurugram-based company ‘FlipItNews’ aims to revolutionize the way Indians perceive finance, business, and capital market investment, by giving it a boost through artificial intelligence (AI) and machine learning (ML). They’re on a mission to reinvent financial literacy for Indians, where financial awareness is driven by smart information discovery and engagement with peers. Through their smart content discovery and contextual engagement, the company is simplifying business, finance, and investment for millennials and first-time investors

Objective:
The goal of this project is to use a bunch of news articles extracted from the companies’ internal database and categorize them into several categories like politics, technology, sports, business and entertainment based on their content. Use natural language processing and create & compare at least three different models.

Attribute Information:
Article
Category
The features names are themselves pretty self-explanatory

Concepts Tested:

Natural Language Processing

Text Processing

Stopwords, Tokenization, Lemmatization

Bag of Words, TF-IDF

Multi-class Classification

What does ‘good’ look like?

Installing & Importing all the required libraries and Loading the dataset.

Conduct a preliminary analysis to understand the structure of the dataset and the distribution of news articles in each category.

Create a user defined function to process the textual data (news articles).

Remove non-letters

Remove Stopwords

Word Tokenize the text

Perform Lemmatization

Display how a single news article looks like before and after the processing.

Encode the target variable (category) using Label/Ordinal encoder.

Create an option for the user to choose between Bag of Words and TF-IDF techniques for vectorizing the data.

Perform train-test split and train a Naive Bayes classifier model using the simple/classical approach.

Evaluate the model’s performance and plot the Confusion Matrix as well as Classification Report.

Functionalize the code and train & evaluate three more classifier models (Decision Tree, Nearest Neighbors, Random Forest).

Observe and comment on the performances of all the models used.

Evaluation Criteria (100 points)

1. Importing the libraries & Reading the data file (10 points)

2. Exploring the dataset (10 points)

Shape of the dataset

News articles per category

3. Processing the Textual Data i.e. the news articles (30 points)

Removing the non-letters

Tokenizing the text

Removing stopwords

Lemmatization

4. Encoding and Transforming the data (20 points)

Encoding the target variable

Bag of Words

TF-IDF

Train-Test Split

5. Model Training & Evaluation (30 points)

Simple Approach

Naive Bayes

Functionalized Code (Optional)

Decision Tree

Nearest Neighbors

Random Forest

Questionnaire:

How many news articles are present in the dataset that we have?

Most of the news articles are from _____ category.

Only ___ no. of articles belong to the ‘Technology’ category.

What are Stop Words and why should they be removed from the text data?

Explain the difference between Stemming and Lemmatization.

Which of the techniques Bag of Words or TF-IDF is considered to be more efficient than the other?

What’s the shape of train & test data sets after performing a 75:25 split.

Which of the following is found to be the best performing model..

a. Random Forest b. Nearest Neighbors c. Naive Bayes

According to this particular use case, both precision and recall are equally important. (T/F)

 One attachment
  •  Scanned by Gmail


In [None]:
!gdown https://mail.google.com/mail/u/0/?hl=ru#inbox/FMfcgzQcqHZXKqQHVwqKgGMdFCDJbQjr?projector=1

In [None]:
get_ipython().system('pip  install --user -U nltk')
import nltk

In [None]:
!pip install category_encoders

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer #For lemmetization
nltk.download('wordnet')

import re
import string
import category_encoders as ce

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer #For Bow and TF-IDF
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
#Performance Metrics for evaluating the model
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score,precision_score, recall_score
from sklearn.metrics import confusion_matrix, classification_report

import warnings
warnings.simplefilter('ignore')

In [None]:
df=pd.read_csv('/content/sample_data/flipitnews-data.csv')
df.head()

In [None]:
display(df.info())

In summary, this output tells us that the DataFrame has 2225 rows and 2 columns, with no missing values. Both columns contain object type data.

In [None]:
display(df.isnull().sum())

In [None]:
df.Category.value_counts()

In [None]:
print("no of rows:", df.shape[0])

In [None]:

plt.figure(figsize=(8,5))
ax=sns.countplot(x='Category',data=df, palette='Reds')
ax.bar_label(ax.containers[0])

The visualization shows the count of news articles for each category. By looking at the bar plot, you can observe how the articles are distributed across the categories:

Sports and Business: Has the highest number of articles.
Politics: Follows after Sports and business.
Technology: Has the lowest number of articles.
Entertainment: Is somewhere in the middle, with a moderate number of articles.
This visualization clearly shows that the dataset is imbalanced, with some categories having significantly more articles than others. This information is important to consider when building and evaluating classification models.

In [None]:
df['Article'][1]

The 'Article' column contains the full text of the news articles. Each entry in this column is a string representing the content of a news article. These are the raw text data that will be used to train the classification models to predict the category of the news article

In [None]:
df.sample(10)

In [None]:
def preprocess_text(text):
    # Remove non-letter characters and convert to lowercase
    text = re.sub('[^a-zA-Z]', ' ', text).lower()

    # Tokenize the text
    tokens = text.split()

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Perform lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # Join the processed words back into a string
    processed_text = ' '.join(tokens)

    return processed_text

**Reasoning**:
The `preprocess_text` function has been defined according to the instructions. The next step is to apply this function to the 'Article' column of the dataframe and display an example before and after processing.

In [None]:
# Display how a single news article looks like before processing
print("Original article:\n", df['Article'][0])

# Apply the preprocessing function to the 'Article' column
df['Processed_Article'] = df['Article'].apply(preprocess_text)

# Display how the same news article looks like after processing
print("\nProcessed article:\n", df['Processed_Article'][0])


Vectorize the processed text using TF-IDF to create the feature matrix `X` for SMOTE.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df['Processed_Article']).toarray()
y = df['Category'].values

In [None]:
import pandas as pd

# Convert the numpy array y to a pandas Series to use value_counts()
y_series = pd.Series(y)

# Display the class distribution
display(y_series.value_counts())

I have displayed the class distribution of y before SMOTE. Now we can apply SMOTE to balance the dataset.

TF

In [None]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE to the vectorized data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print("Shape of original data (X):", X.shape)
print("Shape of original target (y):", y.shape)
print("Shape of resampled data (X_resampled):", X_resampled.shape)
print("Shape of resampled target (y_resampled):", y_resampled.shape)

I have applied SMOTE and displayed the shapes of the original and resampled data. Now I will display the class distribution of the resampled target variable y_resampled to show how the classes are distributed after oversampling.

In [None]:
# Convert the resampled numpy array y_resampled to a pandas Series
y_resampled_series = pd.Series(y_resampled)

# Display the class distribution of the resampled data
display(y_resampled_series.value_counts())

I have successfully balanced the dataset using SMOTE, and you can see the uniform distribution of articles across categories in the output above.

The next step would be to train and evaluate your classification models using this balanced data. You can now use X_resampled and y_resampled for training and testing your models.

In [None]:
from sklearn.model_selection import train_test_split

X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled = train_test_split(X_resampled, y_resampled, test_size=0.25, random_state=42, stratify=y_resampled)

print("Shape of X_train_resampled:", X_train_resampled.shape)
print("Shape of X_test_resampled:", X_test_resampled.shape)
print("Shape of y_train_resampled:", y_train_resampled.shape)
print("Shape of y_test_resampled:", y_test_resampled.shape)

I have split the resampled data into training and testing sets. Now you can train and evaluate your classifier models using X_train_resampled, y_train_resampled, X_test_resampled, and y_test_resampled.

 I will train the suggested classification models (Naive Bayes, Decision Tree, Nearest Neighbors, and Random Forest) on the split and resampled data (X_train_resampled, y_train_resampled, X_test_resampled, y_test_resampled).

Train naive bayes model: Train a Multinomial Naive Bayes classifier on the resampled training data.
Evaluate naive bayes model: Evaluate the performance of the trained Naive Bayes model on the resampled testing data using appropriate metrics (e.g., accuracy, precision, recall, F1-score, confusion matrix).
Train decision tree model: Train a Decision Tree classifier on the resampled training data.
Evaluate decision tree model: Evaluate the performance of the trained Decision Tree model on the resampled testing data.
Train nearest neighbors model: Train a K-Nearest Neighbors classifier on the resampled training data.
Evaluate nearest neighbors model: Evaluate the performance of the trained Nearest Neighbors model on the resampled testing data.
Train random forest model: Train a Random Forest classifier on the resampled training data.
Evaluate random forest model: Evaluate the performance of the trained Random Forest model on the resampled testing data.
Compare model performances: Summarize and compare the evaluation metrics of all trained models to determine the best-performing model.
Finish task: Conclude the model training and evaluation phase.

Train naive bayes model: Train a Multinomial Naive Bayes classifier on the resampled training data.

In [None]:
from sklearn.naive_bayes import MultinomialNB

# Instantiate a MultinomialNB model
nb_model = MultinomialNB()

# Train the Naive Bayes model using the resampled training data
nb_model.fit(X_train_resampled, y_train_resampled)

The Naive Bayes model has been trained on the resampled data. The next step is to evaluate its performance on the resampled test data.

In [None]:
# Make predictions on the resampled test data
y_pred_nb_resampled = nb_model.predict(X_test_resampled)

# Calculate evaluation metrics
accuracy_nb_resampled = accuracy_score(y_test_resampled, y_pred_nb_resampled)
precision_nb_resampled = precision_score(y_test_resampled, y_pred_nb_resampled, average='weighted')
recall_nb_resampled = recall_score(y_test_resampled, y_pred_nb_resampled, average='weighted')
f1_nb_resampled = f1_score(y_test_resampled, y_pred_nb_resampled, average='weighted')

# Print the evaluation metrics
print("Naive Bayes Model Performance on Resampled Test Data:")
print(f"Accuracy: {accuracy_nb_resampled:.4f}")
print(f"Precision (weighted): {precision_nb_resampled:.4f}")
print(f"Recall (weighted): {recall_nb_resampled:.4f}")
print(f"F1-score (weighted): {f1_nb_resampled:.4f}")

# Generate the confusion matrix
cm_nb_resampled = confusion_matrix(y_test_resampled, y_pred_nb_resampled)

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm_nb_resampled, annot=True, fmt='d', cmap='Blues', xticklabels=nb_model.classes_, yticklabels=nb_model.classes_)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix for Naive Bayes Model (Resampled Data)')
plt.show()

Train a Decision Tree classifier on the resampled training data. Evaluate decision tree model: Evaluate the performance of the trained Decision Tree model on the resampled testing data

In [None]:
tf_idf=TfidfVectorizer()
X=tf_idf.fit_transform(df['Article']).toarray()
y=np.array(df['Category'].values)
X_train, X_val, y_train, y_val=train_test_split(df['Article'].values, df['Category'].values, test_size=0.25,shuffle=True,stratify=y)
X_train=tf_idf.fit_transform(X_train).toarray()
X_val=tf_idf.transform(X_val).toarray()

In [None]:
def model_train(obj):
  obj.fit(X_train, y_train)
  y_pred=obj.predict(X_val)
  y_pred_proba=obj.predict_proba(X_val)
  return y_pred, y_pred_proba

def model_eval(obj,y_pred,y_pred_proba):
  print('-------------------------')
  train_acc=accuracy_score(y_train, obj.predict(X_train))
  test_acc=accuracy_score(y_val, obj.predict(X_val))

  print('Train Accuracy:{:.3f}'.format(train_acc))
  print('Test Accuracy:{:.3f}\n'.format(test_acc))
  print('ROC AUC Score:{:.3f}\n'.format(roc_auc_score(y_val, y_pred_proba, multi_class='ovr')))
  precision=precision_score(y_val, y_pred, average='weighted')
  recall=recall_score(y_val, y_pred, average='weighted')
  f1=f1_score(y_val, y_pred, average='weighted')
  print('Precision:{:.3f}'.format(precision))
  print('Recall:{:.3f}'.format(recall))
  print('F1 score:{:.3f}'.format(f1))
  print('-------------------------')

In [None]:
dt=DecisionTreeClassifier()
y_pred_dt,y_pred_proba_dt=model_train(dt)
model_eval(dt, y_pred_dt,y_pred_proba_dt)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# Define the parameter grid to tune
param_grid = {'max_depth': [None, 10, 20, 30, 40, 50]} # You can adjust the range of depths

# Instantiate the Decision Tree classifier
dt = DecisionTreeClassifier(random_state=42)

# Instantiate GridSearchCV
grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='f1_weighted') # Using weighted F1-score as the metric

# Fit GridSearchCV to the resampled training data
grid_search.fit(X_train_resampled, y_train_resampled)

# Print the best hyperparameters and the corresponding score
print("Best hyperparameters:", grid_search.best_params_)
print("Best F1-score (weighted):", grid_search.best_score_)

# Get the best model
best_dt_model = grid_search.best_estimator_

In [None]:
# Instantiate a DecisionTreeClassifier model with the best hyperparameters
dt_model = DecisionTreeClassifier(**grid_search.best_params_, random_state=42)

# Train the Decision Tree model using the resampled training data
dt_model.fit(X_train_resampled, y_train_resampled)

In [None]:

y_pred_dt,y_pred_proba_dt=model_train(dt_model)
model_eval(dt_model, y_pred_dt,y_pred_proba_dt)

In [None]:
rf=RandomForestClassifier()
y_pred_rf,y_pred_proba_rf=model_train(rf)
model_eval(dt, y_pred_rf,y_pred_proba_rf)

###We tried TF-IDF,Now BOW

In [None]:
cv=CountVectorizer(max_features=2500)
X_train, X_val, y_train, y_val=train_test_split(df['Article'].values, df['Category'].values, test_size=0.25,shuffle=True,stratify=y)

In [None]:
X_train=cv.fit_transform(X_train).toarray()
X_val=cv.transform(X_val).toarray()

In [None]:
nb=MultinomialNB()
nb.fit(X_train, y_train)
y_pred_nb=nb.predict(X_val)
y_pred_proba_nb=nb.predict_proba(X_val)
model_eval(nb,y_pred_nb,y_pred_proba_nb)

###Decision Trees

In [None]:
dt=DecisionTreeClassifier(max_depth=18)
y_pred_dt,y_pred_proba_dt=model_train(dt)
model_eval(dt, y_pred_dt,y_pred_proba_dt)

###Random forest

In [None]:
rf=RandomForestClassifier()
y_pred_rf,y_pred_proba_rf=model_train(rf)
model_eval(dt, y_pred_rf,y_pred_proba_rf)

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM,Dense,Embedding,GRU,SimpleRNN
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

###TensorFlow

In [None]:
###RNN Based Approach

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM,Dense,Embedding,GRU,SimpleRNN
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder


In [None]:
df=pd.read_csv('/content/sample_data/flipitnews-data.csv')
df.head()

In [None]:
#https://nlp.stanford.edu/projects/glove/

In [None]:
max_features=5000
maxlen=100
embedding_size=100
batch_size=512
epochs=10

####LSTM

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

In [None]:

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

In [None]:
def preprocess_text(df, text_collumn):
  df[text_collumn]=df[text_collumn].apply(lambda x: x.lower())
  return df

df=preprocess_text(df,'Article')

In [None]:
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(df['Article'])
sequences=tokenizer.texts_to_sequences(df['Article'])
data=pad_sequences(sequences, maxlen=maxlen)

In [None]:
len(tokenizer.word_index)

In [None]:
list(tokenizer.word_index)[:10]

In [None]:
le=LabelEncoder()
labels=le.fit_transform(df['Category'])
labels=tf.keras.utils.to_categorical(labels)

In [None]:
X_train,X_test,y_train, y_test=train_test_split(data,labels,test_size=0.2,random_state=42)

In [None]:
def load_glove_matrix(path, tok, max_feats, dim):
  print("Loading GloVe vectors...(this may take a minute)")
  embeddings_index = {}
  with open(path, "r", encoding="utf-8") as f:
    for line in f:
      values = line.split()
      word = values[0]
      coefs = np.asarray(values[1:], dtype='float32')
      embeddings_index[word] = coefs

  word_index = tok.word_index
  vocab_size = min(max_feats, len(word_index) + 1)
  matrix = np.zeros((vocab_size, dim), dtype="float32")
  for word, i in word_index.items():
    if i >= vocab_size:
      continue
    vec = embeddings_index.get(word)
    if vec is not None:
      matrix[i] = vec
  print(f"Embedding matrix shape: {matrix.shape}")
  return matrix

# Define the path to your GloVe file
glove_path = '/content/glove.6B.100d.txt' # Using the 100d version as specified earlier

# Call the function to load the GloVe matrix
embedding_matrix = load_glove_matrix(glove_path, tokenizer, max_features, embedding_size)

In [None]:
list

In [None]:
model=Sequential([
    Embedding(max_features,embedding_size,weights=[embedding_matrix],
              input_length=maxlen,trainable=False),
    LSTM(100),
    Dense(labels.shape[1], activation="softmax")

])

model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
early_stoppings=EarlyStopping(monitor='val_loss', patience=5)
model.fit(X_train, y_train, batch_size=batch_size,epochs=epochs,validation_data=(X_test,y_test),
          verbose=2, callbacks=[early_stoppings])

In [None]:
def predict_category(text,tokenizer,model,label_encoder,max_len):
  text=text.lower()
  seq=tokenizer.texts_to_sequences([text])
  padded_seq=pad_sequences(seq, maxlen=max_len)
  pred=model.predict(padded_seq)
  print("output of pred:", pred)
  pred_labels_index=np.argmax(pred,axis=1)
  pred_labels=label_encoder.inverse_transform(pred_labels_index)
  return pred_labels[0]

In [None]:
input_text='I love  playing football in big field'
#input_text='I need to make dinsaurs for biofuel'
predict_category=predict_category(input_text, tokenizer, model, le, maxlen)
print("predict_category:", predict_category)

m

In [None]:
input_text='I need to create better algorithims for predicting the stock market'
predicted_category_result = predict_category(input_text, tokenizer, model, le, maxlen)
print("predicted_category:", predicted_category_result)

In [None]:
model=Sequential([
    Embedding(max_features,embedding_size,weights=[embedding_matrix],
              input_length=maxlen,trainable=False),
    SimpleRNN(100),
    Dense(labels.shape[1], activation="softmax")

])


model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

In [None]:
def predict_category(text,tokenizer,model,label_encoder,max_len):
  text=text.lower()
  seq=tokenizer.texts_to_sequences([text])
  padded_seq=pad_sequences(seq, maxlen=max_len)
  pred=model.predict(padded_seq)
  print("output of pred:", pred)
  pred_labels_index=np.argmax(pred,axis=1)
  pred_labels=label_encoder.inverse_transform(pred_labels_index)
  return pred_labels[0]

In [None]:
#input_text='I need to create better algorithms for predicting the stock market'
#input_text='I love seaguls'
input_text='Apple stocks beat google'
predicted_category_result = predict_category(input_text, tokenizer, model, le, maxlen)
print("predicted_category:", predicted_category_result)

## Project Insights and Recommendations

Based on the analysis and modeling performed:

*   **Data Exploration:** The initial exploration showed the distribution of categories in the dataset. This is a good starting point, but further analysis of the text content itself (e.g., word clouds, common phrases per category) could provide deeper insights into the characteristics of each category.
*   **Traditional ML Models (TF-IDF and BOW with Naive Bayes, Decision Tree, Random Forest):**
    *   You've successfully implemented and evaluated these models.
    *   The performance metrics (accuracy, precision, recall, F1-score, ROC AUC) provide a good overview of how well these models perform on the validation set.
    *   Comparing the performance across different vectorization techniques (TF-IDF and BOW) and models is crucial. It seems like the Random Forest with TF-IDF performed reasonably well based on the output you provided earlier.
    *   Consider exploring hyperparameter tuning for these models to potentially improve performance further. Techniques like GridSearchCV or RandomizedSearchCV can be helpful.
*   **Deep Learning Models (LSTM and SimpleRNN with GloVe Embeddings):**
    *   You've set up and trained LSTM and SimpleRNN models with pre-trained GloVe embeddings. This is a good approach for capturing semantic relationships in the text.
    *   The `model_eval` function you created is useful for consistent evaluation.
    *   The prediction function `predict_category` is a good way to test the models on new text.
    *   **Recommendations for Deep Learning:**
        *   **Hyperparameter Tuning:** Experiment with different LSTM/RNN units, dropout rates, and optimizers to see if performance can be improved.
        *   **Epochs and Early Stopping:** You've used Early Stopping, which is good to prevent overfitting. However, observe the training and validation loss curves to understand if the models are converging and if more epochs might be beneficial (while still avoiding overfitting).
        *   **Different Embeddings:** While GloVe is a good choice, consider experimenting with other pre-trained embeddings like Word2Vec or FastText, or even training your own embeddings on your specific dataset if it's large enough.
        *   **Model Architectures:** Explore more complex architectures like Bidirectional LSTMs or GRUs, which can sometimes capture context more effectively.
        *   **Handling Out-of-Vocabulary Words:** With pre-trained embeddings, words not in the vocabulary are represented by zeros. Consider techniques to handle these, such as using a small random vector or exploring FastText embeddings which handle sub-word information.
*   **Evaluation:** You've used several relevant metrics. Consider visualizing the confusion matrices for the deep learning models as well to understand where the models are making mistakes (which categories are being confused with others).
*   **Further Improvements:**
    *   **Text Preprocessing:** Explore more advanced text preprocessing techniques such as stemming, removing stop words (though for some tasks, stop words can be important), and handling special characters or numbers.
    *   **Cross-Validation:** For more robust evaluation, consider using k-fold cross-validation, especially if the dataset is not very large.
    *   **Ensemble Methods:** Combining the predictions of different models (e.g., averaging probabilities or voting) can sometimes lead to improved performance.

Overall, you have a solid foundation for this text classification task. By exploring the recommendations above, you can further enhance the performance and robustness of your models.