# 🏆 Fake News Classification Mini-Hackathon 🏆  

Welcome to this exciting **Machine Learning Society Mini-Hackathon**, a collaboration between **Manu's Machine Learning Lectures** and the **Kaggle Team**!  

📰 **Your Mission:** Build a model that can distinguish between **Fake News (1)** and **Real News (0)**.  
🚀 **Why Participate?** This is your chance to test your machine learning skills, experiment with NLP techniques, and compete against other talented individuals!  

💡 **Key Challenge:** Fake news detection is a crucial task in today's world. Can your model accurately classify news articles based on their content?  

Let’s get started! Good luck and happy coding! 🎯


In [14]:
# Let's import some basic libraries

import kagglehub
import os
import pandas as pd

In [15]:
# Let's download the dataset!

path = kagglehub.dataset_download("emineyetm/fake-news-detection-datasets")
dataset_path = path
# print("Path to dataset files:", path)
# print("Dataset files:", os.listdir(dataset_path))

# 📊 Data Exploration & Train-Test Split  

Before jumping into model training, let’s take a look at the dataset and the rules for this competition.  

### **Dataset Overview**  
This dataset consists of **news articles**, each containing the following features:  
- **Title:** The headline of the article.  
- **Text:** The main content of the article.  
- **Subject:** The category of the article (e.g., politics, world news, etc.).  
- **Date:** The published date of the article.  
- **Target:** The label indicating whether the news is **Fake (1)** or **Real (0)**.  

### **Train-Test Split (75%-25%)**  
For this hackathon, we have **predefined** the dataset split:  
- **Training Set (75%)** – You **must only train** your models on this portion.  
- **Test Set (25%)** – This is used to evaluate the final model performance.  

⚠️ **Important:** To ensure a fair competition, everyone should follow this split and avoid using test data for training.  

Explore the dataset, check for missing values, understand the distributions, and let’s get ready to build some awesome models! 🚀  


In [16]:
subdir_path = os.path.join(dataset_path, "News _dataset")  # Path to subdirectory
print("Files in subdirectory:", os.listdir(subdir_path))

Files in subdirectory: ['True.csv', 'Fake.csv']


In [17]:
true_df = pd.read_csv(os.path.join(subdir_path, 'True.csv'))
true_df

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"
...,...,...,...,...
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017"
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017"
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017"
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017"


In [18]:
fake_df = pd.read_csv(os.path.join(subdir_path, 'Fake.csv'))
fake_df

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"
...,...,...,...,...
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016"
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016"
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016"
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016"


In [19]:
fake_df["target"] = 1 # 1 = Fake News
true_df["target"] = 0 # 0 = True News
df = pd.concat([fake_df, true_df], ignore_index=True)
df = df.sample(frac=1, random_state=1).reset_index(drop=True) # Shuffle the dataset
df

Unnamed: 0,title,text,subject,date,target
0,Trump Calls For This Racist Policy To Be Forc...,Donald Trump is calling for one of the most co...,News,"September 21, 2016",1
1,Republican ex-defense secretary Cohen backs Hi...,WASHINGTON (Reuters) - Former Republican U.S. ...,politicsNews,"September 7, 2016",0
2,"TEACHER QUITS JOB After 5th, 6th Grade Muslim ...",You re never to young to commit jihad Teachers...,politics,"May 9, 2017",1
3,LAURA INGRAHAM RIPS INTO THE PRESS…Crowd Goes ...,Laura Ingraham reminds the Never Trump people ...,politics,"Jul 21, 2016",1
4,Germany's Merkel suffers state vote setback as...,BERLIN/HANOVER (Reuters) - Germany s Social De...,worldnews,"October 14, 2017",0
...,...,...,...,...,...
44893,Guatemala federal auditor to probe president's...,GUATEMALA CITY (Reuters) - Guatemala s federal...,worldnews,"September 13, 2017",0
44894,House Democrats will stage sit-in until they g...,WASHINGTON (Reuters) - U.S. House of Represent...,politicsNews,"June 22, 2016",0
44895,D’oh!: Trump Tells Crowd In Richest County In...,"While in Virginia, GOP presidential nominee Do...",News,"August 3, 2016",1
44896,JUDGE JEANINE TELLS THE LEFT TO KNOCK IT OFF: ...,Judge Jeanine Pirro has had it with the left a...,politics,"Dec 11, 2016",1


In [20]:
train_size = int(0.75 * len(df))

train_df = df.iloc[:train_size]
test_df = df.iloc[train_size:]

print("Training set size:", len(train_df))
print("Test set size:", len(test_df))

Training set size: 33673
Test set size: 11225


In [21]:
train_df

Unnamed: 0,title,text,subject,date,target
0,Trump Calls For This Racist Policy To Be Forc...,Donald Trump is calling for one of the most co...,News,"September 21, 2016",1
1,Republican ex-defense secretary Cohen backs Hi...,WASHINGTON (Reuters) - Former Republican U.S. ...,politicsNews,"September 7, 2016",0
2,"TEACHER QUITS JOB After 5th, 6th Grade Muslim ...",You re never to young to commit jihad Teachers...,politics,"May 9, 2017",1
3,LAURA INGRAHAM RIPS INTO THE PRESS…Crowd Goes ...,Laura Ingraham reminds the Never Trump people ...,politics,"Jul 21, 2016",1
4,Germany's Merkel suffers state vote setback as...,BERLIN/HANOVER (Reuters) - Germany s Social De...,worldnews,"October 14, 2017",0
...,...,...,...,...,...
33668,Schumer says U.S. budget deal doable if Trump ...,WASHINGTON (Reuters) - Senate Democratic Leade...,politicsNews,"April 23, 2017",0
33669,WOMAN PULLED OVER FOR 51 MPH IN SCHOOL ZONE: “...,I ve never been more grateful there are so man...,left-news,"Sep 11, 2015",1
33670,"INDIAN-AMERICAN, Inventor Of Email Announces R...",Boston-based entrepreneur and inventor of Emai...,politics,"Feb 25, 2017",1
33671,Trump aides divided over policy shielding 'dre...,WASHINGTON (Reuters) - Divisions have emerged ...,politicsNews,"January 28, 2017",0


# 🚀 Start Your Submission Here!  

Now it’s your turn to shine! ✨  

📌 **Instructions:**  
- Implement your **feature engineering** and **model training** below.  
- Remember to use only the **75% training set** for training your model.  
- Test your model on the **25% test set** and analyze its performance.  

💡 **Pro Tip:** Try experimenting with different NLP techniques (TF-IDF, Word Embeddings, Transformers) to boost your model’s accuracy!  

Best of luck! 🎯 Let’s see who builds the most accurate Fake News Classifier! 🏆  


In [22]:
# Sample submission

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")

In [23]:
train_df = train_df.copy()
test_df = test_df.copy()

train_df["content"] = train_df["title"] + " " + train_df["subject"] + " " + train_df["text"] 
test_df["content"] = test_df["title"] + " " + test_df["subject"] + " " + test_df["text"] 

print("Training set size:", len(train_df))
print("Test set size:", len(test_df))

vectorizer = TfidfVectorizer(max_features=100000)

X_train = vectorizer.fit_transform(train_df["content"])
X_test = vectorizer.transform(test_df["content"])

y_train = train_df["target"]
y_test = test_df["target"]

model = LogisticRegression(max_iter=100000)
model.fit(X_train, y_train)

Training set size: 33673
Test set size: 11225


# 🏆 Test Your Accuracy Here!  

Now that you've trained your model, let's see how well it performs! 🚀  

📌 **Instructions:**  
- Use the **test set (25%)** to make predictions.  
- Compare your predictions against the actual labels.  
- Calculate the **accuracy** score to evaluate performance.  

📊 **Accuracy Calculation:**  
$$
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
$$

🚀 **Submit your final accuracy score below and see how you rank!** 🔥  

In [24]:

predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Test Accuracy:", accuracy)


Test Accuracy: 0.98913140311804


### Zero-Shot Classification Models

In [25]:
from transformers import pipeline

# Initialize a zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", 
                      model="facebook/bart-large-mnli")

# Example text to classify
sequence_to_classify = "This article claims that the Earth is flat."

# Candidate labels (possible categories for classification)
candidate_labels = ["fake news", "real news", "misleading", "conspiracy"]

# Run inference
result = classifier(sequence_to_classify, candidate_labels)
print(result)

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


{'sequence': 'This article claims that the Earth is flat.', 'labels': ['misleading', 'conspiracy', 'fake news', 'real news'], 'scores': [0.6637750864028931, 0.14953134953975677, 0.12464383244514465, 0.062049709260463715]}


In [26]:
# Initialize zero-shot classification pipeline with a lightweight model
classifier = pipeline("zero-shot-classification", 
                      model="typeform/distilbert-base-uncased-mnli")

# Example text to classify
sequence_to_classify = "This article asserts that climate change is a hoax."

# Candidate labels for classification
candidate_labels = ["fake news", "real news", "misleading", "conspiracy"]

# Run inference
result = classifier(sequence_to_classify, candidate_labels)
print(result)

config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/258 [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
Device set to use cpu


{'sequence': 'This article asserts that climate change is a hoax.', 'labels': ['fake news', 'misleading', 'conspiracy', 'real news'], 'scores': [0.6349366307258606, 0.36300691962242126, 0.0016185108106583357, 0.0004378522571641952]}


In [27]:
# Initialize zero-shot classification pipeline with a distilled BART model
classifier = pipeline("zero-shot-classification", 
                      model="valhalla/distilbart-mnli-12-3")

# Example text to classify
sequence_to_classify = "This report claims that a cure for the common cold has been discovered."

# Candidate labels for classification
candidate_labels = ["fake news", "real news", "misleading", "conspiracy"]

# Run inference
result = classifier(sequence_to_classify, candidate_labels)
print(result)

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Device set to use cpu


{'sequence': 'This report claims that a cure for the common cold has been discovered.', 'labels': ['misleading', 'fake news', 'real news', 'conspiracy'], 'scores': [0.6190851330757141, 0.17369163036346436, 0.12602326273918152, 0.08119994401931763]}


In [30]:
#Trying title only
train_df["title"]
#train_title_only = 

0         Trump Calls For This Racist Policy To Be Forc...
1        Republican ex-defense secretary Cohen backs Hi...
2        TEACHER QUITS JOB After 5th, 6th Grade Muslim ...
3        LAURA INGRAHAM RIPS INTO THE PRESS…Crowd Goes ...
4        Germany's Merkel suffers state vote setback as...
                               ...                        
33668    Schumer says U.S. budget deal doable if Trump ...
33669    WOMAN PULLED OVER FOR 51 MPH IN SCHOOL ZONE: “...
33670    INDIAN-AMERICAN, Inventor Of Email Announces R...
33671    Trump aides divided over policy shielding 'dre...
33672    WHY UNEDUCATED SOMALI REFUGEES Who Don’t Speak...
Name: title, Length: 33673, dtype: object

## Create separate data sets

In [36]:
train_title_df = train_df["title"]
train_text_df = train_df["text"]
test_title_df = test_df["title"]
test_text_df = test_df["text"]

In [37]:
test_title_df

33673    Trump reelection campaign raised $10 million i...
33674    TV Reporter FIRED After Being Caught On Video ...
33675     BREAKING: Dakota Access Pipeline STOPPED By A...
33676    Trump says he did not record conversations wit...
33677    Top Russian and U.S. generals discuss Syria bo...
                               ...                        
44893    Guatemala federal auditor to probe president's...
44894    House Democrats will stage sit-in until they g...
44895     D’oh!: Trump Tells Crowd In Richest County In...
44896    JUDGE JEANINE TELLS THE LEFT TO KNOCK IT OFF: ...
44897    Divided lawmakers battle over Puerto Rico debt...
Name: title, Length: 11225, dtype: object

In [None]:
y_train = train_df["target"]
y_test = test_df["target"]

In [38]:
vectorizer = TfidfVectorizer(max_features=100000)

train_title_df = vectorizer.fit_transform(train_title_df)
test_title_df = vectorizer.transform(test_title_df)

model = LogisticRegression(max_iter=100000)
model.fit(train_title_df, y_train)

predictions = model.predict(test_title_df)
accuracy = accuracy_score(y_test, predictions)
print("Test Accuracy:", accuracy)

Test Accuracy: 0.9510022271714922


In [44]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Preprocessing Function
def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize and Lemmatize
    lemmatizer = WordNetLemmatizer()
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stopwords.words('english')]
    return " ".join(words)

# Apply preprocessing to the raw text before vectorization
train_text_df = train_text_df.apply(preprocess_text)
test_text_df = test_text_df.apply(preprocess_text)

# Now, apply TfidfVectorizer to the preprocessed text
vectorizer = TfidfVectorizer(max_features=100000)
train_text_df = vectorizer.fit_transform(train_text_df)
test_text_df = vectorizer.transform(test_text_df)

AttributeError: 'csr_matrix' object has no attribute 'apply'

In [None]:
model = LogisticRegression(max_iter=100000)
model.fit(train_text_df, y_train)

predictions = model.predict(test_text_df)
accuracy = accuracy_score(y_test, predictions)
print("Test Accuracy:", accuracy)

In [39]:
vectorizer = TfidfVectorizer(max_features=100000)

train_text_df = vectorizer.fit_transform(train_text_df)
test_text_df = vectorizer.transform(test_text_df)

model = LogisticRegression(max_iter=100000)
model.fit(train_text_df, y_train)

predictions = model.predict(test_text_df)
accuracy = accuracy_score(y_test, predictions)
print("Test Accuracy:", accuracy)

Test Accuracy: 0.9856570155902005


In [40]:
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score

# Initialize CatBoostClassifier
catboost_model = CatBoostClassifier(iterations=1000, depth=10, learning_rate=0.05, loss_function='Logloss', verbose=200)

# Train the model
catboost_model.fit(train_title_df, y_train)

# Make predictions
catboost_predictions = catboost_model.predict(test_title_df)

# Calculate accuracy
catboost_accuracy = accuracy_score(y_test, catboost_predictions)
print("CatBoost Test Accuracy:", catboost_accuracy)

0:	learn: 0.6496842	total: 2.3s	remaining: 38m 21s


KeyboardInterrupt: 

In [41]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

# Train the model
rf_model.fit(train_title_df, y_train)

# Make predictions
rf_predictions = rf_model.predict(test_title_df)

# Calculate accuracy
rf_accuracy = accuracy_score(y_test, rf_predictions)
print("Random Forest Test Accuracy:", rf_accuracy)

Random Forest Test Accuracy: 0.8824944320712695


In [42]:
from xgboost import XGBClassifier

# Initialize XGBoost Classifier
xgb_model = XGBClassifier(n_estimators=100, learning_rate=0.05, max_depth=10, random_state=42)

# Train the model
xgb_model.fit(train_title_df, y_train)

# Make predictions
xgb_predictions = xgb_model.predict(test_title_df)

# Calculate accuracy
xgb_accuracy = accuracy_score(y_test, xgb_predictions)
print("XGBoost Test Accuracy:", xgb_accuracy)

XGBoost Test Accuracy: 0.9123385300668151


In [46]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------


Downloader>  d



Download which package (l=list; x=cancel)?


  Identifier>  l


Packages:
  [ ] averaged_perceptron_tagger_eng Averaged Perceptron Tagger (JSON)
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] averaged_perceptron_tagger_rus Averaged Perceptron Tagger (Russian)
  [ ] bcp47............... BCP-47 Language Tags
  [ ] comparative_sentences Comparative Sentence Dataset
  [ ] dolch............... Dolch Word List
  [ ] english_wordnet..... Open English Wordnet
  [ ] extended_omw........ Extended Open Multilingual WordNet
  [ ] framenet_v15........ FrameNet 1.5
  [ ] framenet_v17........ FrameNet 1.7
  [-] inaugural........... C-Span Inaugural Address Corpus
  [ ] maxent_ne_chunker_tab ACE Named Entity Chunker (Maximum entropy)
  [ ] maxent_treebank_pos_tagger_tab Treebank Part of Speech Tagger (Maximum entropy)
  [ ] mwa_ppdb............ The monolingual word aligner (Sultan et al.
                           2015) subset of the Paraphrase Database.
  [ ] nombank.1.0......... NomBank Corpus 1.0
  [ ] nonbreaking_prefixes Non-Br

Hit Enter to continue:  


  [ ] perluniprops........ perluniprops: Index of Unicode Version 7.0.0
                           character properties in Perl
  [ ] punkt_tab........... Punkt Tokenizer Models
  [ ] tagsets_json........ Help on Tagsets (JSON)
  [ ] verbnet3............ VerbNet Lexicon, Version 3.3
  [ ] wmt15_eval.......... Evaluation data from WMT15
  [ ] wordnet2021......... Open English Wordnet 2021
  [ ] wordnet2022......... Open English Wordnet 2022
  [ ] wordnet31........... Wordnet 3.1

Collections:
  [-] all-corpora......... All the corpora
  [-] all-nltk............ All packages available on nltk_data gh-pages
                           branch
  [-] all................. All packages
  [-] book................ Everything used in the NLTK Book
  [-] popular............. Popular packages
  [P] tests............... Packages for running tests
  [ ] third-party......... Third-party data packages

([*] marks installed packages; [-] marks out-of-date or corrupt packages;
 [P] marks partially install

  Identifier>  l


Packages:
  [ ] averaged_perceptron_tagger_eng Averaged Perceptron Tagger (JSON)
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] averaged_perceptron_tagger_rus Averaged Perceptron Tagger (Russian)
  [ ] bcp47............... BCP-47 Language Tags
  [ ] comparative_sentences Comparative Sentence Dataset
  [ ] dolch............... Dolch Word List
  [ ] english_wordnet..... Open English Wordnet
  [ ] extended_omw........ Extended Open Multilingual WordNet
  [ ] framenet_v15........ FrameNet 1.5
  [ ] framenet_v17........ FrameNet 1.7
  [-] inaugural........... C-Span Inaugural Address Corpus
  [ ] maxent_ne_chunker_tab ACE Named Entity Chunker (Maximum entropy)
  [ ] maxent_treebank_pos_tagger_tab Treebank Part of Speech Tagger (Maximum entropy)
  [ ] mwa_ppdb............ The monolingual word aligner (Sultan et al.
                           2015) subset of the Paraphrase Database.
  [ ] nombank.1.0......... NomBank Corpus 1.0
  [ ] nonbreaking_prefixes Non-Br

Hit Enter to continue:  


  [ ] perluniprops........ perluniprops: Index of Unicode Version 7.0.0
                           character properties in Perl
  [ ] punkt_tab........... Punkt Tokenizer Models
  [ ] tagsets_json........ Help on Tagsets (JSON)
  [ ] verbnet3............ VerbNet Lexicon, Version 3.3
  [ ] wmt15_eval.......... Evaluation data from WMT15
  [ ] wordnet2021......... Open English Wordnet 2021
  [ ] wordnet2022......... Open English Wordnet 2022
  [ ] wordnet31........... Wordnet 3.1

Collections:
  [-] all-corpora......... All the corpora
  [-] all-nltk............ All packages available on nltk_data gh-pages
                           branch
  [-] all................. All packages
  [-] book................ Everything used in the NLTK Book
  [-] popular............. Popular packages
  [P] tests............... Packages for running tests
  [ ] third-party......... Third-party data packages

([*] marks installed packages; [-] marks out-of-date or corrupt packages;
 [P] marks partially install

  Identifier>  all-nltk


    Downloading collection 'all-nltk'
       | 
       | Downloading package abc to /usr/share/nltk_data...
       |   Package abc is already up-to-date!
       | Downloading package alpino to /usr/share/nltk_data...
       |   Package alpino is already up-to-date!
       | Downloading package averaged_perceptron_tagger to
       |     /usr/share/nltk_data...
       |   Package averaged_perceptron_tagger is already up-to-date!
       | Downloading package averaged_perceptron_tagger_eng to
       |     /usr/share/nltk_data...
       |   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
       | Downloading package averaged_perceptron_tagger_ru to
       |     /usr/share/nltk_data...
       |   Unzipping taggers/averaged_perceptron_tagger_ru.zip.
       | Downloading package averaged_perceptron_tagger_rus to
       |     /usr/share/nltk_data...
       |   Unzipping taggers/averaged_perceptron_tagger_rus.zip.
       | Downloading package basque_grammars to
       |     /usr/share/nltk

KeyboardInterrupt: Interrupted by user

In [48]:
import nltk

# Download the necessary NLTK data
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [50]:
import nltk

# Try downloading the necessary NLTK resources
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [52]:
import nltk

# Try downloading the necessary NLTK resources
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [54]:
import spacy
import re

# Load the small English model
nlp = spacy.load('en_core_web_sm')

# Preprocessing Function using spaCy
def preprocess_text_spacy(text):
    # Lowercase the text
    text = text.lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-z\s]', '', text)
    
    # Process the text with spaCy
    doc = nlp(text)
    
    # Remove stopwords and apply lemmatization
    words = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    
    return ' '.join(words)

# Apply preprocessing to the title and text columns
train_title_df = train_df["title"].apply(preprocess_text_spacy)
train_text_df = train_df["text"].apply(preprocess_text_spacy)
test_title_df = test_df["title"].apply(preprocess_text_spacy)
test_text_df = test_df["text"].apply(preprocess_text_spacy)

In [None]:
# Use n-grams (bigrams/trigrams) in TF-IDF
vectorizer = TfidfVectorizer(max_features=100000, ngram_range=(1, 2))

# Fit the vectorizer on the training data and transform the test data
train_title_df_tfidf = vectorizer.fit_transform(train_title_df)
test_title_df_tfidf = vectorizer.transform(test_title_df)

train_text_df_tfidf = vectorizer.fit_transform(train_text_df)
test_text_df_tfidf = vectorizer.transform(test_text_df)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model using title TF-IDF features
rf_model.fit(train_title_df_tfidf, y_train)

# Make predictions
rf_predictions = rf_model.predict(test_title_df_tfidf)

# Calculate accuracy
rf_accuracy = accuracy_score(y_test, rf_predictions)
print("Random Forest Test Accuracy:", rf_accuracy)

In [None]:
from xgboost import XGBClassifier

# Initialize XGBoost
xgb_model = XGBClassifier(n_estimators=100, max_depth=10, learning_rate=0.05)

# Train the model using title TF-IDF features
xgb_model.fit(train_title_df_tfidf, y_train)

# Make predictions
xgb_predictions = xgb_model.predict(test_title_df_tfidf)

# Calculate accuracy
xgb_accuracy = accuracy_score(y_test, xgb_predictions)
print("XGBoost Test Accuracy:", xgb_accuracy)

In [None]:
from catboost import CatBoostClassifier

# Initialize CatBoost
catboost_model = CatBoostClassifier(iterations=1000, depth=10, learning_rate=0.05, loss_function='Logloss', verbose=200)

# Train the model using title TF-IDF features
catboost_model.fit(train_title_df_tfidf, y_train)

# Make predictions
catboost_predictions = catboost_model.predict(test_title_df_tfidf)

# Calculate accuracy
catboost_accuracy = accuracy_score(y_test, catboost_predictions)
print("CatBoost Test Accuracy:", catboost_accuracy)