Hi, here is my code to clean the dataset from Kaggle. Kindly check my guidance and comments to understand how I cleaned and processed the data.

Please download the original dataset from my Github link - Data - queried data, which you may need to upload and process:

https://github.com/Rising-Stars-by-Sunshine/STATS201-PS2-Jenny/blob/55fa0bcebfa3144c801e6ec003967747ec3ee92e/Data/queried%20data/Climate%20change_2022-1-17_2022-7-19.csv

## Install and Import
First, we need to import fundamental packages and mount google drive to us to access.

In [28]:
from google.colab import drive
drive.mount('/content/STATS201', force_remount=True)
from google.colab import files
import pandas as pd
import io

Mounted at /content/STATS201


## Load the Dataset
We use panda to display the dataset that we have got from Kaggle.

In [5]:
with open('/content/STATS201/MyDrive/Climate change_2022-1-17_2022-7-19.csv', 'r') as f:
  df = pd.read_csv(f)

In [6]:
df.shape

(9050, 11)

In [11]:
df.tail()

Unnamed: 0,UserScreenName,UserName,Timestamp,Text,Embedded_text,Emojis,Comments,Likes,Retweets,Image link,Tweet URL
9045,Dr Srijana Mitra Das,@srijanapiya17,2022-07-18T12:08:28.000Z,Dr Srijana Mitra Das\n@srijanapiya17\n·\nJul 18,#ClimateChange is now the greatest story on Ea...,,2.0,16.0,24.0,['https://pbs.twimg.com/profile_images/5140754...,https://twitter.com/srijanapiya17/status/15490...
9046,1%_Better_Every_Day,@jh336405,2022-07-18T00:33:20.000Z,1%_Better_Every_Day\n@jh336405\n·\nJul 18,Replying to \n@jh336405\n @acuna_r\n and 41 ot...,💯 💯 🌏,4.0,,,['https://pbs.twimg.com/profile_images/1442412...,https://twitter.com/jh336405/status/1548828230...
9047,David Schechter,@DavidSchechter,2022-07-18T21:13:13.000Z,David Schechter\n@DavidSchechter\n·\nJul 18,While Texans are being asked to use less elect...,,3.0,14.0,23.0,['https://pbs.twimg.com/card_img/1549138950475...,https://twitter.com/DavidSchechter/status/1549...
9048,Daily Climate,@TheDailyClimate,2022-07-18T10:15:09.000Z,Daily Climate\n@TheDailyClimate\n·\nJul 18,"Sea levels are rising, and communities are scr...",,,3.0,,['https://pbs.twimg.com/card_img/1547862999808...,https://twitter.com/TheDailyClimate/status/154...
9049,Robot Enthusiast,@robothusiast,2022-07-18T16:32:53.000Z,Robot Enthusiast\n@robothusiast\n·\nJul 18,UTS creates algae-analysing robot to combat cl...,,,,,['https://pbs.twimg.com/card_img/1549069752219...,https://twitter.com/robothusiast/status/154906...


With a basic understanding of what the dataset looks like, we may find it's better to clean the data to fit our needs.

Here is a general guide to help you understand how I clean the data:


**Data Cleaning and Processing Guide:**

**1. Duplicate Removal:**
   - **Objective:** Ensure each tweet is unique.
   - **Step:** Identify and remove duplicate tweets based on their unique 'Tweet URL.'

**2. Tweet Text Cleaning:**
   - **Objective:** Enhance text quality and readability.
   - **Steps:**
      - **Full Cleaning:**
         - Remove unnecessary elements like URLs, emojis, mentions, smileys, and numbers from the entire tweet.
         - Create a new column 'cleaned_text' containing fully cleaned tweets.
      - **Partial Cleaning:**
         - Remove URLs and mentions, creating a 'partial_clean' column.

**3. Additional Cleaning:**
   - **Objective:** Standardize and improve text consistency.
   - **Steps:**
      - Remove '#' symbols from the 'cleaned_text' column.
      - Convert all text to lowercase.
      - Remove non-alphanumeric characters and extra whitespaces.

Before we write the code for data cleaning, we first install the preprocessor that we need.

In [14]:
!pip install preprocessor
!pip install tweet-preprocessor

Collecting preprocessor
  Downloading preprocessor-1.1.3.tar.gz (4.2 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: preprocessor
  Building wheel for preprocessor (setup.py) ... [?25l[?25hdone
  Created wheel for preprocessor: filename=preprocessor-1.1.3-py3-none-any.whl size=4474 sha256=7ade5508513bee63a0e0ba64020e86a38b078a1fb8659ca91b0854ab20bce02f
  Stored in directory: /root/.cache/pip/wheels/b6/26/c1/ce3fdf93e63c11019da18dbeec4c09309d3edc5461c2a5fffb
Successfully built preprocessor
Installing collected packages: preprocessor
Successfully installed preprocessor-1.1.3
Collecting tweet-preprocessor
  Downloading tweet_preprocessor-0.6.0-py3-none-any.whl (27 kB)
Installing collected packages: tweet-preprocessor
Successfully installed tweet-preprocessor-0.6.0


In [19]:
import preprocessor as p

In [23]:
# Step 1: Drop duplicate rows based on the 'Tweet URL' column
df.drop_duplicates(subset=['Tweet URL'], keep='first', inplace=True)

# Step 2: Define a function to clean the tweet using preprocessor
def clean_tweet(tweet):
    p.set_options(p.OPT.URL, p.OPT.EMOJI, p.OPT.MENTION, p.OPT.SMILEY, p.OPT.NUMBER)
    # Use preprocessor to clean the tweet
    cleaned_tweet = p.clean(tweet)
    return cleaned_tweet

# Step 3: Define a function to partially clean the tweet (remove URL and mention)
def partial_clean(tweet):
    p.set_options(p.OPT.URL, p.OPT.MENTION)
    # Use preprocessor to clean the tweet
    cleaned_tweet = p.clean(tweet)
    return cleaned_tweet

# Step 4: Apply the partial_clean function to create a new 'partial_clean' column
df['partial_clean'] = df['Embedded_text'].apply(partial_clean)

# Step 5: Apply the clean_tweet function to create a new 'cleaned_text' column
df['cleaned_text'] = df['Embedded_text'].apply(clean_tweet)

# Step 6: Further cleaning - Remove '#' symbols, convert to lowercase, and clean extra whitespaces
df['cleaned_text'] = df['cleaned_text'].str.replace(r'#', '', regex=True)
df['cleaned_text'] = df['cleaned_text'].str.lower().str.replace('[^\w\s]', ' ').str.replace('\s\s+', ' ')

# Step 7: Display the cleaned dataframe
display(df)


Unnamed: 0,UserScreenName,UserName,Timestamp,Text,Embedded_text,Emojis,Comments,Likes,Retweets,Image link,Tweet URL,mentioned_users,partial_clean,cleaned_text
0,Lauren Boebert,@laurenboebert,2022-01-17T23:32:38.000Z,Lauren Boebert\n@laurenboebert\n·\nJan 18,The only solution I’ve ever heard the Left pro...,,1683,2259,11.7K,[],https://twitter.com/laurenboebert/status/14832...,,The only solution I’ve ever heard the Left pro...,the only solution ive ever heard the left prop...
1,Catherine,@catherine___c,2022-01-17T22:54:02.000Z,Catherine\n@catherine___c\n·\nJan 17,Climate change doesn’t cause volcanic eruption...,,158,64,762,[],https://twitter.com/catherine___c/status/14832...,,Climate change doesn’t cause volcanic eruption...,climate change doesnt cause volcanic eruptions
2,king Keith,@KaConfessor,2022-01-17T23:51:41.000Z,king Keith\n@KaConfessor\n·\nJan 18,Vaccinated tennis ball boy collapses in the te...,,24,118,159,['https://pbs.twimg.com/ext_tw_video_thumb/148...,https://twitter.com/KaConfessor/status/1483225...,,Vaccinated tennis ball boy collapses in the te...,vaccinated tennis ball boy collapses in the te...
3,PETRIFIED CLIMATE PARENT,@climate_parent,2022-01-17T21:42:04.000Z,PETRIFIED CLIMATE PARENT\n@climate_parent\n·\n...,North America has experienced an average winte...,,15,50,158,[],https://twitter.com/climate_parent/status/1483...,,North America has experienced an average winte...,north america has experienced an average winte...
4,Thomas Speight,@Thomas_Sp8,2022-01-17T21:10:40.000Z,Thomas Speight\n@Thomas_Sp8\n·\nJan 17,They're gonna do the same with Climate Change ...,🅾,4,24,127,['https://pbs.twimg.com/profile_images/1544171...,https://twitter.com/Thomas_Sp8/status/14831850...,joeywreck,They're gonna do the same with Climate Change ...,they re gonna do the same with climate change ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9045,Dr Srijana Mitra Das,@srijanapiya17,2022-07-18T12:08:28.000Z,Dr Srijana Mitra Das\n@srijanapiya17\n·\nJul 18,#ClimateChange is now the greatest story on Ea...,,2,16,24,['https://pbs.twimg.com/profile_images/5140754...,https://twitter.com/srijanapiya17/status/15490...,"DrSimEvans,DrSimEvans",#ClimateChange is now the greatest story on Ea...,climatechange is now the greatest story on ear...
9046,1%_Better_Every_Day,@jh336405,2022-07-18T00:33:20.000Z,1%_Better_Every_Day\n@jh336405\n·\nJul 18,Replying to \n@jh336405\n @acuna_r\n and 41 ot...,💯 💯 🌏,4,,,['https://pbs.twimg.com/profile_images/1442412...,https://twitter.com/jh336405/status/1548828230...,"jh336405,acuna_r,rahmstorf",Replying to and 41 others And Stefan Rahmstorf...,replying to and others and stefan rahmstorf co...
9047,David Schechter,@DavidSchechter,2022-07-18T21:13:13.000Z,David Schechter\n@DavidSchechter\n·\nJul 18,While Texans are being asked to use less elect...,,3,14,23,['https://pbs.twimg.com/card_img/1549138950475...,https://twitter.com/DavidSchechter/status/1549...,GregAbbott_TX,While Texans are being asked to use less elect...,while texans are being asked to use less elect...
9048,Daily Climate,@TheDailyClimate,2022-07-18T10:15:09.000Z,Daily Climate\n@TheDailyClimate\n·\nJul 18,"Sea levels are rising, and communities are scr...",,,3,,['https://pbs.twimg.com/card_img/1547862999808...,https://twitter.com/TheDailyClimate/status/154...,NHBulletinNews,"Sea levels are rising, and communities are scr...",sea levels are rising and communities are scra...


Now we have the cleaned dataset!

We only need one step to download it. You will see the cleaned dataset updated in the specific file folder /content/STATS201/MyDrive.

In [27]:
import os

output_folder = '/content/STATS201/MyDrive'
output_file_path = os.path.join(output_folder, 'cleaned_dataset.csv')

df.to_csv(output_file_path, index=False)