<a href="https://colab.research.google.com/github/AhmedCoolProjects/ESI/blob/main/Text_Mining_Project_Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BY AHMED BARGADY

# Data Preprocessing Notebook 🧹

In this notebook, we'll focus on preparing our collected data for analysis by applying various cleaning and tokenizing techniques. The cleanliness and consistency of our dataset are pivotal for deriving meaningful insights.

## Table of Contents
1. [Data Loading](#data-loading)
2. [Data Cleaning](#data-cleaning)
   - 2.1 [Handling Missing Values](#handling-missing-values)
   - 2.2 [Removing Duplicates](#removing-duplicates)
   - 2.3 [Text Cleaning](#text-cleaning)
5. [Tokenizing Data](#)
4. [Saving Processed Data](#saving-processed-data)

Let's dive in!


# Data Loading

In [26]:
import pandas as pd

csv_file_path = 'https://firebasestorage.googleapis.com/v0/b/esi-school-resources.appspot.com/o/text_mining%2Fproject%2Fahmed_bargady_collected_data.csv?alt=media&token=cb47884a-aff3-42bb-9e94-cc325cfa2f1a'

# Load data from CSV into a DataFrame
df = pd.read_csv(csv_file_path)

# Display the first few rows of the DataFrame to inspect the loaded data
df.head()

Unnamed: 0,content
0,Congress could do much more to protect America...
1,Christina Iverson and Jeff Chen ring in the Ne...
2,"All year long, Earth passes through streams of..."
3,"Never miss an eclipse, a meteor shower, a rock..."
4,A year full of highs and lows in space just en...


In [27]:
df.shape

(99587, 1)

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99587 entries, 0 to 99586
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   content  98268 non-null  object
dtypes: object(1)
memory usage: 778.1+ KB


**We can see we have some null rows**

In [29]:
df.describe()

Unnamed: 0,content
count,98268
unique,37947
top,"Get breaking national and world news, broadcas..."
freq,408


**We can see we have some duplicated rows**

# Data Cleaning

### Handling Missing Values

In [30]:
# Check for missing values in the DataFrame
missing_values = df.isnull().sum()

# Display the count of missing values for each column
print("Missing Values:\n", missing_values)

Missing Values:
 content    1319
dtype: int64


**So we have 1319 empty rows**

In [31]:
# Drop rows with missing values (if needed)
df = df.dropna()

# Display the first few rows of the cleaned DataFrame
df.head()

Unnamed: 0,content
0,Congress could do much more to protect America...
1,Christina Iverson and Jeff Chen ring in the Ne...
2,"All year long, Earth passes through streams of..."
3,"Never miss an eclipse, a meteor shower, a rock..."
4,A year full of highs and lows in space just en...


In [32]:
df.shape

(98268, 1)

In [33]:
df.isnull().sum()

content    0
dtype: int64

**Great! no null rows anymore**

### Removing Duplicates

In [34]:
# Check for duplicate rows in the DataFrame
duplicates = df.duplicated()

# Display the count of duplicate rows
print("Duplicate Rows:\n", duplicates.sum())

Duplicate Rows:
 60321


**So we have 60321 rows duplicated**!

In [35]:
# Drop duplicate rows (if needed)
df = df.drop_duplicates()

# Display the first few rows of the DataFrame without duplicates
df.head()

Unnamed: 0,content
0,Congress could do much more to protect America...
1,Christina Iverson and Jeff Chen ring in the Ne...
2,"All year long, Earth passes through streams of..."
3,"Never miss an eclipse, a meteor shower, a rock..."
4,A year full of highs and lows in space just en...


In [36]:
df.shape

(37947, 1)

In [37]:
df.duplicated().sum()

0

**Great! no duplicates anymore**

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37947 entries, 0 to 98086
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   content  37947 non-null  object
dtypes: object(1)
memory usage: 592.9+ KB


In [39]:
df.describe()

Unnamed: 0,content
count,37947
unique,37947
top,Congress could do much more to protect America...
freq,1


**Amazing! our top element which is the frequent, is freq for one time ✅**

### Text Cleaning

In [47]:
# import regex
import re

# download stopwords
import nltk
from nltk.stem import PorterStemmer
nltk.download('stopwords')

# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [48]:
stemmer = PorterStemmer()

In [50]:
# Function for text cleaning
def clean_text(text):
    # Remove punctuation and numbers
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove stopwords
    words = text.split()
    words = [word for word in words if word not in stop_words]
    # Apply stemming
    words = [stemmer.stem(word) for word in words]
    # Join the words to form the processed text
    text = ' '.join(words)
    return text

# Apply the cleaning function to the 'content' column
df['cleaned_content'] = df['content'].apply(clean_text)

# Display the first few rows of the DataFrame with cleaned text
df.head()

Unnamed: 0,content,cleaned_content
0,Congress could do much more to protect America...,congress could much protect american serv coun...
1,Christina Iverson and Jeff Chen ring in the Ne...,christina iverson jeff chen ring new year
2,"All year long, Earth passes through streams of...",year long earth pass stream cosmic debri here ...
3,"Never miss an eclipse, a meteor shower, a rock...",never miss eclips meteor shower rocket launch ...
4,A year full of highs and lows in space just en...,year full high low space end month come full n...


# Tokenizing Data

In [58]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [63]:
MAX_LENGTH = 100
VOCAB_SIZE = 25000
SEQUENCE_LENGTH = 50

In [64]:
# Tokenizing the text
tokenizer = Tokenizer(num_words=VOCAB_SIZE)
tokenizer.fit_on_texts(df['cleaned_content'])

In [65]:
# converting text to sequences
sequences = tokenizer.texts_to_sequences(df['cleaned_content'])

In [66]:
padded_sequences = pad_sequences(sequences, maxlen=MAX_LENGTH, padding='post', truncating='post')

In [67]:
# Displaying the first sequence as an example
print("Original text:\n", df['cleaned_content'][0], " len: ", len(df['cleaned_content'][0].split(" ")))
print("\nTokenized sequence:\n", sequences[0], " len: ", len(sequences[0]))
print("\nPadded sequence:\n", padded_sequences[0], " len: ", len(padded_sequences[0]))

Original text:
 congress could much protect american serv countri predatori forprofit colleg  len:  10

Tokenized sequence:
 [720, 23, 177, 272, 46, 736, 25, 6624, 13872, 355]  len:  10

Padded sequence:
 [  720    23   177   272    46   736    25  6624 13872   355     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0]  len:  100


# Save Preprocessed Data

In [68]:
import numpy as np

# Save padded_sequences and sequences using NumPy
np.save('padded_sequences.npy', padded_sequences)
np.save('sequences.npy', sequences)

  arr = np.asanyarray(arr)


In [69]:
# Save DataFrame to a CSV file
df.to_csv('cleaned_df.csv', index=False)