<a href="https://colab.research.google.com/github/Meitiann/INF2008-ML-Labs/blob/main/Project_Datapreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [76]:
##Import required libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

In [77]:
train_bodies = pd.read_csv('train_bodies_processed.csv')
train_stances = pd.read_csv('train_stances_processed.csv')

In [78]:
train_bodies.head()

Unnamed: 0,Body ID,articleBody
0,0,A small meteorite crashed into a wooded area i...
1,4,Last week we hinted at what was to come as Ebo...
2,5,(NEWSER) – Wonder how long a Quarter Pounder w...
3,6,"Posting photos of a gun-toting child online, I..."
4,7,At least 25 suspected Boko Haram insurgents we...


In [79]:
train_stances.head()

Unnamed: 0,Headline,Body ID,Stance
0,Police find mass graves with at least '15 bodi...,712,unrelated
1,Hundreds of Palestinians flee floods in Gaza a...,158,agree
2,"Christian Bale passes on role of Steve Jobs, a...",137,unrelated
3,HBO and Apple in Talks for $15/Month Apple TV ...,1034,unrelated
4,Spider burrowed through tourist's stomach and ...,1923,disagree


Check Missing Values

In [80]:
train_bodies.isnull().sum()

Unnamed: 0,0
Body ID,0
articleBody,0


In [81]:
train_stances.isnull().sum()

Unnamed: 0,0
Headline,0
Body ID,0
Stance,0


Check and Remove Duplicates

In [82]:
train_stances_duplicates = train_stances.duplicated(subset=['Headline', 'Body ID', 'Stance'], keep='first').sum()
print(f"Total duplicate rows: {train_stances_duplicates}")


Total duplicate rows: 402


In [83]:
train_bodies_duplicates = train_bodies.duplicated(subset=['Body ID', 'articleBody'], keep='first').sum()
print(f"Total duplicate rows: {train_bodies_duplicates}")


Total duplicate rows: 0


In [84]:
train_stances_cleaned = train_stances.drop_duplicates(subset=['Headline', 'Body ID', 'Stance'], keep='first')


In [85]:
train_bodies_cleaned = train_bodies.drop_duplicates(subset=['Body ID', 'articleBody'], keep='first')

In [86]:
def preprocess_text(text):
  #remove extra whitespace

  text = ' '.join(text.split())

  return text

In [87]:
train_bodies_cleaned['articleBody'] = train_bodies_cleaned['articleBody'].apply(preprocess_text)

In [88]:
train_stances_cleaned['Headline'] = train_stances_cleaned['Headline'].apply(preprocess_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_stances_cleaned['Headline'] = train_stances_cleaned['Headline'].apply(preprocess_text)


In [89]:
#Encode the stance labels
label_encoder = LabelEncoder()
train_stances_cleaned['Stance'] = label_encoder.fit_transform(train_stances_cleaned['Stance'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_stances_cleaned['Stance'] = label_encoder.fit_transform(train_stances_cleaned['Stance'])


In [90]:
stance_classes = label_encoder.classes_
print("Stance Classes:", stance_classes)

Stance Classes: ['agree' 'disagree' 'discuss' 'unrelated']


In [91]:
#Remove 'unrelated' and 'discuss' stances
#train_stances_filtered = train_stances_cleaned[(train_stances_cleaned['Stance'] != 2) & (train_stances_cleaned['Stance'] != 3)]
#train_stances_filtered

In [92]:
#merged_df = pd.merge(train_bodies, train_stances_filtered, on='Body ID', how='inner')
merged_df = pd.merge(train_bodies, train_stances_cleaned, on='Body ID', how='inner')

In [93]:
new_column_order = ['Body ID', 'Headline', 'articleBody', 'Stance']

merged_df = merged_df[new_column_order]

In [94]:
merged_df = merged_df.sort_values('Body ID', ascending = True)
display(merged_df.head())

Unnamed: 0,Body ID,Headline,articleBody,Stance
0,0,"Soldier shot, Parliament locked down after gun...",A small meteorite crashed into a wooded area i...,3
20,0,Soldier shot near Canadian parliament building,A small meteorite crashed into a wooded area i...,3
21,0,Soldier shot in Ottawa at War Memorial,A small meteorite crashed into a wooded area i...,3
22,0,Soldier shot at War Memorial; multiple shots f...,A small meteorite crashed into a wooded area i...,3
23,0,Comcast Is Threatening To Cut Off Customers Wh...,A small meteorite crashed into a wooded area i...,3


In [95]:
merged_df.to_csv('preprocessed_merged_data.csv', index=False, encoding="utf-8-sig")
print("\nPreprocessed and merged data saved to 'preprocessed_merged_data.csv'")


Preprocessed and merged data saved to 'preprocessed_merged_data.csv'


In [96]:
tfidf = TfidfVectorizer()
merged_df["combined_text"] = merged_df["Headline"] + " " + merged_df["articleBody"]
bow_rep_tfidf = tfidf.fit_transform(merged_df["combined_text"])

display(bow_rep_tfidf.toarray())

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.11368925, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.12016244, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.11156256, 0.        , ..., 0.        , 0.        ,
        0.        ]])