# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# URL for IMDb reviews
url = "https://www.imdb.com/title/tt0111161/reviews"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

def scrape_imdb_reviews(num_reviews=1000):
    data = []
    page = 1

    while len(data) < num_reviews:
        try:
            response = requests.get(url, headers=HEADERS)
            if response.status_code != 200:
                print(f"Failed to retrieve page {page}. Status Code: {response.status_code}")
                break

            soup = BeautifulSoup(response.text, 'html.parser')
            reviews = soup.find_all('div', class_='sc-8c7aa573-5 gBEznl')
            users = soup.find_all('div', {'data-testid': 'reviews-author'})
            ratings = soup.find_all('span', class_='ipc-rating-star--rating')


            if not reviews:
                print("No more reviews found.")
                break

            for review, user_block, rating_tag in zip(reviews, users, ratings):
                try:

                    # Extracting user ID
                    user_tag = user_block.find("a", {"data-testid": "author-link"})
                    user_id = user_tag.text.strip() if user_tag else "No User ID"

                    # Extracting user profile link
                    user_profile_link = "https://www.imdb.com" + user_tag["href"] if user_tag else "No Profile Link"
                    # Extracting review title
                    title_tag = review.find("h3", class_="ipc-title__text")
                    title = title_tag.text.strip() if title_tag else "No title"

                    # Extracting review text
                    text_tag = review.find("div", class_="ipc-html-content-inner-div")
                    text = text_tag.text.strip() if text_tag else "No detailed text"


                    if text == "No detailed text":
                        text_alt_tag = review.find_next_sibling("div", {"data-testid": "review-overflow"})
                        text = text_alt_tag.text.strip() if text_alt_tag else "No text"

                    # Extracting Rating
                    rating = rating_tag.text.strip() if rating_tag else "No Rating"

                    # data
                    data.append({
                          "User-Id": user_id,
                          "Title": title,
                          "Comment": text,
                          "Rating": rating,
                          "Page": page  # Store the page number
                    })


                    if len(data) >= num_reviews:
                        break

                except Exception as e:
                    print(f"Skipping review due to error: {e}")

            page += 1

        except Exception as e:
            print(f"Error scraping page {page}: {e}")

        time.sleep(2)  # Adding  delay

    return pd.DataFrame(data)

# Scrapping 1000 reviews
df = scrape_imdb_reviews(1000)

df.to_csv("imdb_reviews.csv", index=False)

# Displaying the first few rows
df.head()

Unnamed: 0,User-Id,Title,Comment,Rating,Page
0,hitchcockthelegend,Some birds aren't meant to be caged.,No text,10,1
1,Sleepin_Dragon,An incredible movie. One that lives with you.,It is no wonder that the film has such a high ...,10,1
2,EyeDunno,Don't Rent Shawshank.,I'm trying to save you money; this is the last...,10,1
3,kaspen12,A classic piece of unforgettable film-making.,No text,10,1
4,alexkolokotronis,This is How Movies Should Be Made,This movie is not your ordinary Hollywood flic...,10,1


# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [27]:
import re
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

#1. Function to remove noise
def remove_noise(text):
    if pd.isna(text):
        return ""
    return re.sub(r'[^a-zA-Z\s]', '', text)

#2. Function to remove numbers
def remove_numbers(text):
    if pd.isna(text):
        return ""
    return re.sub(r'\d+', '', text)

#3. Function to remove stopwords
def remove_stopwords(text):
    if pd.isna(text):  # Handle NaN values
        return ""
    stop = set(stopwords.words('english'))
    words = text.split()
    return " ".join([word for word in words if word.lower() not in stop or word.lower() == "no"])

#5. Function to apply stemming
def apply_stemming(text):
    stemmer = PorterStemmer()
    words = text.split()
    return " ".join([stemmer.stem(word) for word in words])

#6. Function to apply lemmatization
def apply_lemmatization(text):
    lemmatizer = WordNetLemmatizer()
    words = text.split()
    return " ".join([lemmatizer.lemmatize(word) for word in words])

# Applying text cleaning steps
df['clean_title'] = df['Title'].str.lower().apply(remove_noise).apply(remove_numbers).apply(remove_stopwords)
df['clean_comment'] = df['Comment'].str.lower().apply(remove_noise).apply(remove_numbers).apply(remove_stopwords)

# Applying stemming and lemmatization
df['stemmed_title'] = df['clean_title'].apply(apply_stemming)
df['stemmed_comment'] = df['clean_comment'].apply(apply_stemming)
df['lemmatized_title'] = df['clean_title'].apply(apply_lemmatization)
df['lemmatized_comment'] = df['clean_comment'].apply(apply_lemmatization)

# Saving cleaned data to CSV
df.to_csv("cleaned_imdb_reviews.csv", index=False)

# cleaned dataset
df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,User-Id,Title,Comment,Rating,Page,clean_title,clean_comment,stemmed_title,stemmed_comment,lemmatized_title,lemmatized_comment
0,hitchcockthelegend,Some birds aren't meant to be caged.,No text,10,1,birds arent meant caged,no text,bird arent meant cage,no text,bird arent meant caged,no text
1,Sleepin_Dragon,An incredible movie. One that lives with you.,It is no wonder that the film has such a high ...,10,1,incredible movie one lives,no wonder film high rating quite literally bre...,incred movi one live,no wonder film high rate quit liter breathtak ...,incredible movie one life,no wonder film high rating quite literally bre...
2,EyeDunno,Don't Rent Shawshank.,I'm trying to save you money; this is the last...,10,1,dont rent shawshank,im trying save money last film title consider ...,dont rent shawshank,im tri save money last film titl consid borrow...,dont rent shawshank,im trying save money last film title consider ...
3,kaspen12,A classic piece of unforgettable film-making.,No text,10,1,classic piece unforgettable filmmaking,no text,classic piec unforgett filmmak,no text,classic piece unforgettable filmmaking,no text
4,alexkolokotronis,This is How Movies Should Be Made,This movie is not your ordinary Hollywood flic...,10,1,movies made,movie ordinary hollywood flick great deep mess...,movi made,movi ordinari hollywood flick great deep messa...,movie made,movie ordinary hollywood flick great deep mess...


# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [53]:
#1.
import pandas as pd
import spacy
from collections import Counter

# Load spaCy's English language model
nlp = spacy.load("en_core_web_sm")

# Load the cleaned IMDb Reviews dataset
df_imdb = pd.read_csv("/content/cleaned_imdb_reviews.csv")

# Function for POS tagging and counting specific tags
def pos_analysis(text):
    doc = nlp(text)

    # Count specific POS categories
    pos_counts = Counter(token.pos_ for token in doc)
    noun_count = pos_counts["NOUN"]  # Nouns
    verb_count = pos_counts["VERB"]  # Verbs
    adj_count = pos_counts["ADJ"]    # Adjectives
    adv_count = pos_counts["ADV"]    # Adverbs

    return [(token.text, token.pos_) for token in doc], noun_count, verb_count, adj_count, adv_count

# Apply POS tagging on a subset of IMDb data
df_imdb_sample = df_imdb[['clean_comment']].dropna().head(10)  # Using first 10 comments for analysis

imdb_pos_results = []
for comment in df_imdb_sample['clean_comment']:
    pos_tags, noun_count, verb_count, adj_count, adv_count = pos_analysis(comment)
    imdb_pos_results.append((pos_tags, noun_count, verb_count, adj_count, adv_count))

df_imdb_pos = pd.DataFrame(imdb_pos_results, columns=['POS Tags', 'Nouns', 'Verbs', 'Adjectives', 'Adverbs'])

# Display results
df_imdb_pos.head()




Unnamed: 0,POS Tags,Nouns,Verbs,Adjectives,Adverbs
0,"[(no, DET), (text, NOUN)]",1,0,0,0
1,"[(no, DET), (wonder, NOUN), (film, NOUN), (hig...",20,12,8,4
2,"[(i, PRON), (m, AUX), (trying, VERB), (save, V...",92,65,32,21
3,"[(no, DET), (text, NOUN)]",1,0,0,0
4,"[(movie, NOUN), (ordinary, ADJ), (hollywood, N...",52,37,16,10


In [60]:
import pandas as pd
import spacy
from collections import Counter
from spacy import displacy

# Loading spaCy's English language model
nlp = spacy.load("en_core_web_sm")

df_imdb = pd.read_csv("cleaned_imdb_reviews.csv")

# Function for Dependency Parsing
def dependency_parsing(sentence):
    doc = nlp(sentence)
    return [(token.text, token.dep_, token.head.text) for token in doc]  # Token, dependency relation, head word

# Function for Named Entity Recognition (NER)
def named_entity_recognition(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    entity_counts = Counter(ent.label_ for ent in doc.ents)
    return entities, entity_counts

# Applying parsing and NER on a subset of IMDb data
df_imdb_sample = df_imdb[['clean_comment']].dropna().head(5)  # Using first 5 comments for analysis

dependency_trees = []
ner_results = []

for comment in df_imdb_sample['clean_comment']:
    dependency_tree = dependency_parsing(comment)
    dependency_trees.append(dependency_tree)

    named_entities, entity_counts = named_entity_recognition(comment)
    ner_results.append((named_entities, entity_counts))

# Converting results into DataFrames
df_ner = pd.DataFrame(ner_results, columns=['Named Entities', 'Entity Counts'])

# results
df_ner.head()




Unnamed: 0,Named Entities,Entity Counts
0,[],{}
1,[],{}
2,"[(five bucks, MONEY), (today, DATE), (seven, C...","{'MONEY': 1, 'DATE': 1, 'CARDINAL': 2, 'PERSON..."
3,[],{}
4,"[(tim, PERSON), (one, CARDINAL), (andy dufresn...","{'PERSON': 4, 'CARDINAL': 2, 'ORG': 1, 'ORDINA..."


# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [33]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

# GitHub Marketplace URL
BASE_URL = "https://github.com/marketplace?type=actions&page="
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://github.com/marketplace",
}

def scrape_github_marketplace(pages=50):
    data = []
    session = requests.Session()

    for page in range(1, pages + 1):
        url = BASE_URL + str(page)

        try:
            response = session.get(url, headers=HEADERS)
            if response.status_code != 200:
                print(f"Failed to retrieve page {page}. Status Code: {response.status_code}")
                continue

            soup = BeautifulSoup(response.text, 'html.parser')
            actions = soup.find_all('div', class_='position-relative border rounded-2 d-flex marketplace-common-module__marketplace-item--MohVH gap-3 p-3')

            for action in actions:
                try:
                    name_tag = action.find('h3', class_='d-flex f4 lh-condensed prc-Heading-Heading-6CmGO')
                    desc_tag = action.find('p', class_='mt-1 mb-0 text-small fgColor-muted line-clamp-2')
                    link_tag = action.find('a', href=True)

                    name = name_tag.text.strip() if name_tag else "Unknown"
                    description = desc_tag.text.strip() if desc_tag else "No description"
                    link = "https://github.com" + link_tag['href'].strip() if link_tag else "No link"

                    data.append([name, description, link, page])
                except Exception as e:
                    print(f"Skipping item due to error: {e}")

        except Exception as e:
            print(f"Error scraping page {page}: {e}")

        time.sleep(5)

    return pd.DataFrame(data, columns=['Product Name', 'Description', 'URL', 'Page'])

# Scrapping data
df = scrape_github_marketplace(50)

# Saving raw data
df.to_csv("github_marketplace_raw.csv", index=False)

df.head()


Unnamed: 0,Product Name,Description,URL,Page
0,TruffleHog OSS,Scan Github Actions with TruffleHog,https://github.com/marketplace/actions/truffle...,1
1,Metrics embed,An infographics generator with 40+ plugins and...,https://github.com/marketplace/actions/metrics...,1
2,yq - portable yaml processor,"create, read, update, delete, merge, validate ...",https://github.com/marketplace/actions/yq-port...,1
3,Super-Linter,Super-linter is a ready-to-run collection of l...,https://github.com/marketplace/actions/super-l...,1
4,Gosec Security Checker,Runs the gosec security checker,https://github.com/marketplace/actions/gosec-s...,1


In [42]:
import pandas as pd
import nltk
from textblob import TextBlob
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')
stop = set(stopwords.words('english'))


# Function for tokenization and stopword removal using TextBlob
def clean_and_tokenize_text(text):
    if pd.isna(text):
        return []
    words = list(TextBlob(text.lower()).words)
    words = [word for word in words if word not in stop]
    return words

# Applying text cleaning, tokenization, and stopword removal
df['tokenized_product_name'] = df['Product Name'].apply(clean_and_tokenize_text)
df['tokenized_description'] = df['Description'].apply(clean_and_tokenize_text)

# Saving cleaned and tokenized data
df.to_csv("github_marketplace_cleaned_tokenized.csv", index=False)

df.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Product Name,Description,URL,Page,tokenized_product_name,tokenized_description
0,TruffleHog OSS,Scan Github Actions with TruffleHog,https://github.com/marketplace/actions/truffle...,1,"[trufflehog, oss]","[scan, github, actions, trufflehog]"
1,Metrics embed,An infographics generator with 40+ plugins and...,https://github.com/marketplace/actions/metrics...,1,"[metrics, embed]","[infographics, generator, 40, plugins, 300, op..."
2,yq - portable yaml processor,"create, read, update, delete, merge, validate ...",https://github.com/marketplace/actions/yq-port...,1,"[yq, portable, yaml, processor]","[create, read, update, delete, merge, validate..."
3,Super-Linter,Super-linter is a ready-to-run collection of l...,https://github.com/marketplace/actions/super-l...,1,[super-linter],"[super-linter, ready-to-run, collection, linte..."
4,Gosec Security Checker,Runs the gosec security checker,https://github.com/marketplace/actions/gosec-s...,1,"[gosec, security, checker]","[runs, gosec, security, checker]"


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [None]:
!pip install tweepy pandas nltk




In [None]:
import tweepy
import pandas as pd
import time
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# NLTK data
nltk.download("stopwords")
nltk.download("punkt")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
import tweepy

# Keys and tokens
API_KEY = "A9XM4CUUxiv1EREVmrHI8d6p2"
API_SECRET = "SwFHbe7KCxVq4LnPNmbPB5gdC5gNoL8wrlT5xohtfjfilPL02V"
ACCESS_TOKEN = "1891368043550523392-B73A6c2pcq4eOyWu5WnpJNX6sR1QfE"
ACCESS_SECRET = "zAxBnd1H2cUo1c3kMWWwipRNkgxV5XsC4SMwZgzCTJfAM"
BEARER_TOKEN = "AAAAAAAAAAAAAAAAAAAAAAFSzQEAAAAA2R3F0aAZwUeffItrLy425q6z3fA%3DdY8Qw6IInFXo7QLYvVhukPKkQDX8eFjY0GrWhe9H6NYGaPD6J8"


# Authentication using OAuth 1.0a
auth = tweepy.OAuth1UserHandler(API_KEY, API_SECRET, ACCESS_TOKEN, ACCESS_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True)

client = tweepy.Client(bearer_token=BEARER_TOKEN)

print("✅ Authentication Successful!")


✅ Authentication Successful!


In [None]:
import tweepy
import pandas as pd
import time
import os

# Initializing Tweepy Client with Authentication
client = tweepy.Client(bearer_token=BEARER_TOKEN)

# Function to scrape tweets
def scrape_tweets(query, num_tweets=100):
    tweets_data = []

    try:
        for tweet in tweepy.Paginator(
            client.search_recent_tweets,
            query=query,
            tweet_fields=["id", "text", "created_at", "author_id"],
            max_results=50
        ).flatten(limit=num_tweets):

            tweets_data.append({
                "Tweet_ID": tweet.id,
                "Author_ID": tweet.author_id,
                "Text": tweet.text,
                "Created_At": tweet.created_at
            })

    except tweepy.TooManyRequests:
        print("🔴 API Rate Limit Reached! Waiting 15 minutes before retrying...")
        time.sleep(900)
        return scrape_tweets(query, num_tweets)
    except tweepy.TweepyException as e:
        print(f"🔴 An error occurred: {e}")
        return pd.DataFrame()

    return pd.DataFrame(tweets_data)

# Scraping tweets related to AI & Machine Learning
df_tweets = scrape_tweets("#MachineLearning OR #ArtificialIntelligence -is:retweet", num_tweets=50)

#  Saving tweets to a CSV file
df_tweets.to_csv("AI_ML_Tweets.csv", index=False)

print("✅ Successfully collected & saved 50 tweets in 'AI_ML_Tweets.csv'!")


🔴 API Rate Limit Reached! Waiting 15 minutes before retrying...
✅ Successfully collected & saved 50 tweets in 'AI_ML_Tweets.csv'!


In [None]:
# Loading Raw Tweets
df = pd.read_csv("AI_ML_Tweets.csv")
df.head()

Unnamed: 0,Tweet_ID,Author_ID,Text,Created_At
0,1891375023689998679,2164266428,Mind blown! Scientists have created an AI that...,2025-02-17 06:32:00+00:00
1,1891375000436777303,1486657334566961159,RT @Parajulisaroj16: Unlike univariate or biva...,2025-02-17 06:31:54+00:00
2,1891374976042733896,1858225741244338176,AI is NOT the future—it’s the NOW!\nFrom smart...,2025-02-17 06:31:48+00:00
3,1891374858426073321,1755274625615851520,🚀Day 59 of #300DaysOfMachineLearning\nPOS Tagg...,2025-02-17 06:31:20+00:00
4,1891374719024140641,121378193,Perplexity’s Deep Research tool is now FREE! U...,2025-02-17 06:30:47+00:00


In [None]:
import pandas as pd
import re

# Loading the raw tweets dataset
df = pd.read_csv("AI_ML_Tweets.csv")

# Step 1: Removing Retweets
df = df[~df["Text"].str.startswith("RT @")]  # Remove rows where text starts with "RT @"

# Step 2: Handling Missing Values
df.dropna(inplace=True)  # Drop rows with missing values

# Step 3: Removing Duplicates
df.drop_duplicates(subset=["Tweet_ID"], inplace=True)  # Drop duplicate tweets based on Tweet_ID

# Step 4: Cleaning Text
def clean_text(text):
    # Removing URLs
    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)
    # Removing mentions (@username)
    text = re.sub(r"@\w+", "", text)
    # Removing hashtags (#)
    text = re.sub(r"#\w+", "", text)
    # Removing newlines and extra spaces
    text = re.sub(r"\n", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

df["Text"] = df["Text"].apply(clean_text)

# Step 5: Converting `Created_At` to DateTime
df["Created_At"] = pd.to_datetime(df["Created_At"])

# Step 6: Data Quality Check
print("\nCleaned Data:")
print(df.head())
print("\nData Info:")
print(df.info())
print("\nMissing Values:")
print(df.isnull().sum())
print("\nDuplicate Rows:")
print(df.duplicated().sum())

# Saving the cleaned data to a new CSV file
df.to_csv("Cleaned_AI_ML_Tweets.csv", index=False)
print("\n✅ Successfully cleaned and saved the data to 'Cleaned_AI_ML_Tweets.csv'!")


Cleaned Data:
              Tweet_ID            Author_ID  \
0  1891375023689998679           2164266428   
2  1891374976042733896  1858225741244338176   
3  1891374858426073321  1755274625615851520   
4  1891374719024140641            121378193   
5  1891374673159426245           3221243756   

                                                Text                Created_At  
0  Mind blown! Scientists have created an AI that... 2025-02-17 06:32:00+00:00  
2  AI is NOT the future—it’s the NOW! From smart ... 2025-02-17 06:31:48+00:00  
3  🚀Day 59 of POS Tagging in NLP: - POS Tagging –... 2025-02-17 06:31:20+00:00  
4  Perplexity’s Deep Research tool is now FREE! U... 2025-02-17 06:30:47+00:00  
5  Understanding SAS Data Engineering: A Complete... 2025-02-17 06:30:36+00:00  

Data Info:
<class 'pandas.core.frame.DataFrame'>
Index: 28 entries, 0 to 45
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 

In [None]:
from google.colab import files
files.download("Cleaned_Twitter_Tweets.csv")


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog