# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [None]:
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen


narrators_data = []


main_url = "https://ddr.densho.org/narrators/?page={}"


counter = 0
max_narrators = 904


for page_num in range(1, 46):
    link1 = Request(main_url.format(page_num), headers={'User-Agent': 'Mozilla/5.0'})
    url1 = urlopen(link1)
    data1 = url1.read()
    data1_soup = BeautifulSoup(data1, 'html.parser')

    print(f"*** Processing Page {page_num} ***")  # Print page number for tracking

    # Iterate through each narrator on the page
    for narrator_link in data1_soup.find_all('h4'):
        if counter >= max_narrators:
            break

        link2 = Request(narrator_link.a.get('href'), headers={'User-Agent': 'Mozilla/5.0'})
        url2 = urlopen(link2)
        data2 = url2.read()
        data2_soup = BeautifulSoup(data2, 'html.parser')

        # Extract narrator name and biography
        try:
            narrator = data2_soup.find("div", attrs={'class': 'col-sm-8 col-md-8'})
            name = narrator.h1.text.strip().replace('"', "") if narrator.h1 else ""
            bio = narrator.p.text.strip() if narrator.p else ""
        except:
            name, bio = "", ""

        # Store extracted data
        narrators_data.append({'Narrator_Name': name, 'Bio': bio})
        counter += 1  # Increase counter

    if counter >= max_narrators:
        break  # Stop if we reach the desired number


df = pd.DataFrame(narrators_data)

# Save DataFrame to CSV
df.to_csv('narrators_data.csv', index=False, encoding='utf-8')

print(f"✅ Total narrators processed: {len(narrators_data)}")
print("📁 Data saved as 'narrators_data.csv'")


*** Processing Page 1 ***
*** Processing Page 2 ***
*** Processing Page 3 ***
*** Processing Page 4 ***
*** Processing Page 5 ***
*** Processing Page 6 ***
*** Processing Page 7 ***
*** Processing Page 8 ***
*** Processing Page 9 ***
*** Processing Page 10 ***
*** Processing Page 11 ***
*** Processing Page 12 ***
*** Processing Page 13 ***
*** Processing Page 14 ***
*** Processing Page 15 ***
*** Processing Page 16 ***
*** Processing Page 17 ***
*** Processing Page 18 ***
*** Processing Page 19 ***
*** Processing Page 20 ***
*** Processing Page 21 ***
*** Processing Page 22 ***
*** Processing Page 23 ***
*** Processing Page 24 ***
*** Processing Page 25 ***
*** Processing Page 26 ***
*** Processing Page 27 ***
*** Processing Page 28 ***
*** Processing Page 29 ***
*** Processing Page 30 ***
*** Processing Page 31 ***
*** Processing Page 32 ***
*** Processing Page 33 ***
*** Processing Page 34 ***
*** Processing Page 35 ***
*** Processing Page 36 ***
*** Processing Page 37 ***
✅ Total na

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Ensure required NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

# Load the dataset
df = pd.read_csv('narrators_data.csv')


print(df.head())


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...


           Narrator_Name                                                Bio
0           Kay Aiko Abe  Nisei female. Born May 9, 1927, in Selleck, Wa...
1                Art Abe  Nisei male. Born June 12, 1921, in Seattle, Wa...
2  Sharon Tanagi Aburano  Nisei female. Born October 31, 1925, in Seattl...
3        Toshiko Aiboshi  Nisei female. Born July 8, 1928, in Boyle Heig...
4      Douglas L. Aihara  Sansei male. Born March 15, 1950, in Torrance,...


[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
def remove_noise(text):
    return re.sub(r'[^a-zA-Z\s]', '', text)

# Apply function to Bio column
df['Cleaned_Bio'] = df['Bio'].astype(str).apply(remove_noise)


print(df[['Bio', 'Cleaned_Bio']].head())


                                                 Bio  \
0  Nisei female. Born May 9, 1927, in Selleck, Wa...   
1  Nisei male. Born June 12, 1921, in Seattle, Wa...   
2  Nisei female. Born October 31, 1925, in Seattl...   
3  Nisei female. Born July 8, 1928, in Boyle Heig...   
4  Sansei male. Born March 15, 1950, in Torrance,...   

                                         Cleaned_Bio  
0  Nisei female Born May   in Selleck Washington ...  
1  Nisei male Born June   in Seattle Washington G...  
2  Nisei female Born October   in Seattle Washing...  
3  Nisei female Born July   in Boyle Heights Cali...  
4  Sansei male Born March   in Torrance Californi...  


In [None]:
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

# Apply function
df['Cleaned_Bio'] = df['Cleaned_Bio'].apply(remove_numbers)


print(df[['Bio', 'Cleaned_Bio']].head())


                                                 Bio  \
0  Nisei female. Born May 9, 1927, in Selleck, Wa...   
1  Nisei male. Born June 12, 1921, in Seattle, Wa...   
2  Nisei female. Born October 31, 1925, in Seattl...   
3  Nisei female. Born July 8, 1928, in Boyle Heig...   
4  Sansei male. Born March 15, 1950, in Torrance,...   

                                         Cleaned_Bio  
0  Nisei female Born May   in Selleck Washington ...  
1  Nisei male Born June   in Seattle Washington G...  
2  Nisei female Born October   in Seattle Washing...  
3  Nisei female Born July   in Boyle Heights Cali...  
4  Sansei male Born March   in Torrance Californi...  


In [None]:
import nltk
nltk.download('punkt')  # For tokenization
nltk.download('stopwords')  # For stop words list


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
def remove_stopwords(text):
    # Tokenize using split() and lowercase the text
    words = text.lower().split()
    stop_words = set(nltk.corpus.stopwords.words('english'))  # Set of stopwords
    return " ".join([word for word in words if word not in stop_words])


In [None]:
df['Cleaned_Bio'] = df['Cleaned_Bio'].str.lower()

# Show sample output
print(df[['Bio', 'Cleaned_Bio']].head())


                                                 Bio  \
0  Nisei female. Born May 9, 1927, in Selleck, Wa...   
1  Nisei male. Born June 12, 1921, in Seattle, Wa...   
2  Nisei female. Born October 31, 1925, in Seattl...   
3  Nisei female. Born July 8, 1928, in Boyle Heig...   
4  Sansei male. Born March 15, 1950, in Torrance,...   

                                         Cleaned_Bio  
0  nisei female born may   in selleck washington ...  
1  nisei male born june   in seattle washington g...  
2  nisei female born october   in seattle washing...  
3  nisei female born july   in boyle heights cali...  
4  sansei male born march   in torrance californi...  


In [None]:
import nltk
nltk.download('punkt', download_dir='/root/nltk_data')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def apply_stemming(text):

    words = text.split()  # Split by whitespace
    return " ".join([stemmer.stem(word) for word in words])

# Apply function
df['Cleaned_Bio'] = df['Cleaned_Bio'].apply(apply_stemming)

# Show sample output
print(df[['Bio', 'Cleaned_Bio']].head())


                                                 Bio  \
0  Nisei female. Born May 9, 1927, in Selleck, Wa...   
1  Nisei male. Born June 12, 1921, in Seattle, Wa...   
2  Nisei female. Born October 31, 1925, in Seattl...   
3  Nisei female. Born July 8, 1928, in Boyle Heig...   
4  Sansei male. Born March 15, 1950, in Torrance,...   

                                         Cleaned_Bio  
0  nisei femal born may in selleck washington spe...  
1  nisei male born june in seattl washington grew...  
2  nisei femal born octob in seattl washington fa...  
3  nisei femal born juli in boyl height californi...  
4  sansei male born march in torranc california g...  


In [None]:
import nltk
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def apply_lemmatization(text):
    # Use split() instead of word_tokenize
    words = text.split()
    return " ".join([lemmatizer.lemmatize(word) for word in words])

# Apply function
df['Cleaned_Bio'] = df['Cleaned_Bio'].apply(apply_lemmatization)

# Show sample output
print(df[['Bio', 'Cleaned_Bio']].head())


                                                 Bio  \
0  Nisei female. Born May 9, 1927, in Selleck, Wa...   
1  Nisei male. Born June 12, 1921, in Seattle, Wa...   
2  Nisei female. Born October 31, 1925, in Seattl...   
3  Nisei female. Born July 8, 1928, in Boyle Heig...   
4  Sansei male. Born March 15, 1950, in Torrance,...   

                                         Cleaned_Bio  
0  nisei femal born may in selleck washington spe...  
1  nisei male born june in seattl washington grew...  
2  nisei femal born octob in seattl washington fa...  
3  nisei femal born juli in boyl height californi...  
4  sansei male born march in torranc california g...  


In [None]:
df.to_csv('cleaned_narrators_data.csv', index=False, encoding='utf-8')

print("✅ Data cleaning completed. Saved as 'cleaned_narrators_data.csv'")


✅ Data cleaning completed. Saved as 'cleaned_narrators_data.csv'


# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('words')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [None]:
import spacy
import pandas as pd
from collections import Counter

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Load the cleaned data from the CSV file
df = pd.read_csv('cleaned_narrators_data.csv')

# Extract the cleaned text from the 'Cleaned_Bio' column
cleaned_narrators_data = df['Cleaned_Bio'].iloc[0]

# Process the text using spaCy
doc = nlp(cleaned_narrators_data)

# 1. Parts of Speech (POS) Tagging
pos_counts = Counter()
for token in doc:
    pos_counts[token.pos_] += 1

# Print POS counts
print("Parts of Speech (POS) Counts:")
print(f"Nouns (NOUN): {pos_counts['NOUN']}")
print(f"Verbs (VERB): {pos_counts['VERB']}")
print(f"Adjectives (ADJ): {pos_counts['ADJ']}")
print(f"Adverbs (ADV): {pos_counts['ADV']}")

# 2. Constituency Parsing and Dependency Parsing
# Dependency Parsing
print("\nDependency Parsing Trees:")
for sent in doc.sents:
    print("\nSentence:", sent.text)
    for token in sent:
        print(f"Word: {token.text}, Dependency: {token.dep_}, Head: {token.head.text}")



# For each sentence, print the constituency tree (use Benepar after installation)
print("\nConstituency Parsing Trees:")
for sent in doc.sents:

    print("Constituency parsing tree can be visualized with Benepar library")

# 3. Named Entity Recognition (NER)
ner_counts = Counter()
for ent in doc.ents:
    ner_counts[ent.label_] += 1

# Print NER counts
print("\nNamed Entity Recognition (NER) Counts:")
for entity, count in ner_counts.items():
    print(f"{entity}: {count}")

# Example to explain Dependency Parsing and Constituency Parsing:
print("\nExplanation of Parsing Trees with One Example Sentence:")

example_sentence = "Apple was founded by Steve Jobs in Cupertino in 1976."

# Dependency Parsing of the example sentence
doc_example = nlp(example_sentence)
print("\nDependency Parsing of Example Sentence:")
for token in doc_example:
    print(f"Word: {token.text}, Dependency: {token.dep_}, Head: {token.head.text}")




Parts of Speech (POS) Counts:
Nouns (NOUN): 11
Verbs (VERB): 3
Adjectives (ADJ): 2
Adverbs (ADV): 0

Dependency Parsing Trees:

Sentence: nisei femal born may selleck washington spent much childhood beaverton oregon father own farm influenc earli age parent convers christian world war ii remov portland assembl center oregon minidoka concentr camp idaho war work establish success volunt program feed homeless seattl washington
Word: nisei, Dependency: compound, Head: femal
Word: femal, Dependency: nsubj, Head: selleck
Word: born, Dependency: acl, Head: femal
Word: may, Dependency: aux, Head: selleck
Word: selleck, Dependency: compound, Head: washington
Word: washington, Dependency: nsubj, Head: spent
Word: spent, Dependency: ROOT, Head: spent
Word: much, Dependency: amod, Head: beaverton
Word: childhood, Dependency: compound, Head: beaverton
Word: beaverton, Dependency: dobj, Head: spent
Word: oregon, Dependency: compound, Head: father
Word: father, Dependency: appos, Head: beaverton
Wor

# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib.error import HTTPError
import json
import pandas as pd
import numpy as np
import re

marketplace_url = "https://github.com/marketplace?type=actions"
total_pages = 500

# Create an empty list to store product data
product_data = []

for page_number in range(1, total_pages + 1):
    request_link = Request(marketplace_url, headers={'User-Agent': 'Mozilla/5.0'})
    response_url = urlopen(request_link)
    page_data = response_url.read()
    soup = BeautifulSoup(page_data).find('div', class_="mt-4 marketplace-common-module__marketplace-list-grid--vCk7D")

    marketplace_items = soup.find_all('div', class_="position-relative border rounded-2 d-flex marketplace-common-module__marketplace-item--MohVH gap-3 p-3")

    for item in marketplace_items:
        product_name_container = item.find('div', class_="d-flex flex-justify-between flex-items-start gap-3")
        product_name = product_name_container.find('a').text
        product_info = item.find('p').text
        # Append product details to the product_data list
        product_data.append([product_name, product_info])
        print(product_name)
        print(product_info)


df_products = pd.DataFrame(product_data, columns=["Product Name", "Product Info"])

# Save the DataFrame to a CSV file
csv_filename = "Github_marketplace_products.csv"
df_products.to_csv(csv_filename, index=False, encoding="utf-8")

print(f"CSV file '{csv_filename}' has been created successfully!")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m

run-digger
Manage terraform collaboration

Rebuild Armbian
Build Armbian Linux

GitHub Script
Run simple scripts using the GitHub client

Deploy to GitHub Pages
This action will handle the deployment process of your project to GitHub Pages

ChatGPT CodeReviewer
A Code Review Action Powered By ChatGPT

FTP Deploy
Automate deploying websites and more with this GitHub action via FTP and FTPS

TruffleHog OSS
Scan Github Actions with TruffleHog

Metrics embed
An infographics generator with 40+ plugins and 300+ options to display stats about your GitHub account

yq - portable yaml processor
create, read, update, delete, merge, validate and do more with yaml

Super-Linter
Super-linter is a ready-to-run collection of linters and code analyzers, to help validate your source code

Gosec Security Checker
Runs the gosec security checker

Rebuild Armbian and Kernel
Support Amlogic, Rockchip and Allwinner boxes

OpenCommit — improve c

#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [None]:
pip install tweepy



In [None]:
import tweepy

#  keys and tokens
api_key = 'LEUfNWAANi1QGqbFbrymuYeDd'
api_key_secret = 'bTqKo8wPM75hBHNNtWdPdlPw7r5AQFlwVsDfSir8rUUlXCq589'
access_token = '1892088892427468802-fAc8wavNBjTuINmG77livRitO3VK3H'
access_token_secret = 'AAAAAAAAAAAAAAAAAAAAAAeHzQEAAAAA4gwxNFEUSMcC%2BnI%2FUPYZX8g6ocM%3DlshMAd4oD4ubYBdJG82SsUYByrccB5XLn2DeakNRklXqnR6K6h'

# Authenticate with Twitter
auth = tweepy.OAuth1UserHandler(
    consumer_key=api_key,
    consumer_secret=api_key_secret,
    access_token=access_token,
    access_token_secret=access_token_secret
)
api = tweepy.API(auth)

In [None]:
import tweepy

# bearer token
client = tweepy.Client(bearer_token='AAAAAAAAAAAAAAAAAAAAAAeHzQEAAAAA4gwxNFEUSMcC%2BnI%2FUPYZX8g6ocM%3DlshMAd4oD4ubYBdJG82SsUYByrccB5XLn2DeakNRklXqnR6K6h')


# I have used same query as explained in the class, But now i am exceeded to run new query
query = "#llm -is:retweet"
tweets = client.search_recent_tweets(query=query, tweet_fields=["created_at", "text", "author_id"], max_results=100)

# Printing tweets
for tweet in tweets.data:
    print(f"User: {tweet.author_id}, Tweet: {tweet.text}")

User: 1633336812105084932, Tweet: Dobby: The best LLM personality, outperforming SOTA models. Compare with Llama 70B: https://t.co/lcFI4d7mhX #AI #LLM for a couple of seconds
Dobby: the ultimate LLM personality—outshining SOTA. Thanks, @confident_ai. Compare: https://t.co/lcFI4d7mhX
#AI #LLM 

@SentientAGI
User: 7365972, Tweet: sn-news: #open #ai #ml #llm Researchers open source Sky-T1, a ‘reasoning’ AI model that can be trained for less than $450 https://t.co/0SpdH3rCVY
User: 1574016451006107648, Tweet: Introducing Root Judge, a powerful mid-sized LLM that enables reliable and customizable LLM system evaluations. Trained on 384 AMD Radeon Instinct™ MI250X GPUs using the LUMI Supercomputer. #LLM #LLMs #evaluation #llm_as_a_judge #llama @RootSignals
Check: https://t.co/VBiCck3jM8 https://t.co/UhOXzSwjD5
User: 109488608, Tweet: El lado del mal - Perplexity Pro Deep Research: Gratis para todos los clientes de Movistar https://t.co/g6UjCKsysq #Perplexity #Movistar #Telefónica #IA #AI #GenA

In [None]:
import pandas as pd

# list to store tweet data
tweet_data = []

if tweets.data:
    for tweet in tweets.data:
        tweet_data.append({
            'tweet_id': tweet.id,
            'author_id': tweet.author_id,
            'created_at': tweet.created_at,
            'text': tweet.text
        })


df = pd.DataFrame(tweet_data)


print(df)


df.to_csv("tweets_csv", index=False)

               tweet_id            author_id                created_at  \
0   1892095836839956836  1633336812105084932 2025-02-19 06:16:15+00:00   
1   1892093508346880020              7365972 2025-02-19 06:07:00+00:00   
2   1892091762023317995  1574016451006107648 2025-02-19 06:00:04+00:00   
3   1892090363415756880            109488608 2025-02-19 05:54:30+00:00   
4   1892089823545974850  1890662331782426625 2025-02-19 05:52:21+00:00   
..                  ...                  ...                       ...   
95  1891900670401294391           3559557437 2025-02-18 17:20:44+00:00   
96  1891898788635345362             19656981 2025-02-18 17:13:15+00:00   
97  1891895544622014733            361631391 2025-02-18 17:00:22+00:00   
98  1891893520564355405  1719074185346211840 2025-02-18 16:52:19+00:00   
99  1891892174804889706           2863292385 2025-02-18 16:46:58+00:00   

                                                 text  
0   Dobby: The best LLM personality, outperforming...  

# Mandatory question

Provide Your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment

I was able to apply my data science skills, particularly text preprocessing like stemming and lemmatization, thanks to this interesting assignment.

Challenges:

Managing missing resources, such as the NLTK punkt tokenizer, was the biggest challenge and needed a lot of effort to get through.

Features I Enjoyed:

Getting the data cleaned and converted successfully was satisfying.

Time Taken to Complete:

Although getting past the NLTK issues took longer than anticipated, the given time was mostly sufficient.



# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog