# Prepare the complete data set

This notebook integrates the contents of the urls from "keywords_emptyText.csv" into the data set with all urls

Verizon, Group 41
<br>Athena Bai, Tia Zheng, Kathy Yang, Tapuwa Kabaira, Chris Smith

Last updated: Nov. 28, 2024

In [1]:
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
#import seaborn as sns

In [2]:
# Read data files
working_urls = pd.read_csv("df_text.csv", header=0)
remaining_urls = pd.read_csv("remaining_contents.csv", header=0)
labels = pd.read_csv("categorizedurls.csv", header=0)

In [3]:
# Inspect the columns of the two dataframes
print(working_urls.columns.values)
print(remaining_urls.columns.values)

['url' ' category' 'text_content' 'Text_Length' 'text_cleaned' 'Sentiment']
['url' 'content']


In [4]:
# Make a copy of each DataFrame
working_urls = working_urls.copy()
remaining_urls = remaining_urls.copy()
labels = labels.copy()

In [5]:
working_urls.drop(columns='text_cleaned', inplace=True)

In [6]:
remaining_urls.columns.values

array(['url', 'content'], dtype=object)

In [7]:
# Find the content of 'zillow.com', which contains a recurring error message
print(remaining_urls.loc[remaining_urls['url'] == 'zillow.com', 'content'].iloc[0])

Access to this page has been denied 按住以确认您是人类 （而非机器人）。 有问题吗？ 参考ID 3055a895-ad61-11ef-87f4-aae8c35e1003 报告问题 此页面遇到问题？请告诉我们： 您可以联系我们以寻求帮助。您应该使用参考ID : 3055a895-ad61-11ef-87f4-aae8c35e1003 您也可以向我们发送您的反馈： I’m a bot 我不知道在哪里进行确认 我一直收到“请再试一次”的消息 其他（请在下面详细说明） 遇到其他问题？ 取消 发送 ✓ 感谢您的反馈


In [8]:
# Find the content of 'priceline.com', which contains a recurring error message
print(remaining_urls.loc[remaining_urls['url'] == 'priceline.com', 'content'].iloc[0])

Access to this page has been denied. Priceline Security check: Please confirm that you are a real Priceline user. Access to this page has been denied because we believe you are using automation tools to browse the website. This may happen as a result of the following:  is disabled or blocked by an extension (ad blockers for example) Your browser does not support  Please make sure that  and  are enabled on your browser and that you are not blocking them from loading. Reference ID: #2c277113-ad63-11ef-aa3f-8b2e1d612e0b


In [9]:
error_message = []
error_message.append('Access to this page has been denied')
error_message.append(' 按住以确认您是人类 （而非机器人）。 有问题吗？ 参考ID ')
error_message.append('报告问题 此页面遇到问题？请告诉我们： 您可以联系我们以寻求帮助。您应该使用参考ID :')
error_message.append('您也可以向我们发送您的反馈： I’m a bot 我不知道在哪里进行确认 我一直收到“请再试一次”的消息 其他（请在下面详细说明） 遇到其他问题？ 取消 发送 ✓ 感谢您的反馈')
error_message.append('is disabled or blocked by an extension')

In [10]:
# # Translation test code
# from deep_translator import GoogleTranslator
# to_translate = 'あなたは老师啊'
# translated = GoogleTranslator(source='auto', target='en').translate(to_translate)
# print(translated)

In [11]:
# Functions to process the texts in remaining_urls 
from deep_translator import GoogleTranslator
from langdetect import detect
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import RegexpTokenizer
import time

def remove_error_messages(content):
    for message in error_message:
        content = content.replace(message, "")
    return content
    
# Translate non-English texts in remaining_urls to English
def preprocess(content):
    if not isinstance(content, str):  # Check if the content is not a string (e.g. None)
        return content
    remove_error_messages(content)
    max_chars = 5000
    # Don't translate if the text exceeds max_chars for GoogleTranslator
    # because the two long texts that cause errors only have English 
    if (len(content) < max_chars) and (detect(content) != 'en'):
        try:
            return GoogleTranslator(source='auto', target='en').translate(content)
        except Exception as e:
            print(f"Translation failed: {e}")
            print(content)
            return content
    return content

# From model.py (by Tia)
sia = SentimentIntensityAnalyzer()
def calc_sentiment(text):
    return sia.polarity_scores(text)['compound'] if isinstance(text, str) else 0


# For feature 'lexical_diversity'

# # From model.py (by Tia)
# tokenizer = RegexpTokenizer(r'\w+')
# def tokenize_text(text):
#     return tokenizer.tokenize(text) if isinstance(text, str) else []

# # From model.py (by Tia)
# def lexical_diversity(text):
#     tokens = tokenize_text(text)
#     return len(set(tokens)) / len(tokens) if len(tokens) > 0 else 0

In [12]:
# Add features to the remaining urls
import time

remaining_urls['category'] = None
remaining_urls.rename(columns={'content': 'text_content'}, inplace=True)
remaining_urls['Text_Length'] = remaining_urls['text_content'].str.len().fillna(0)
for index, row in remaining_urls.iterrows():
    content = row['text_content']
    processed = preprocess(content)
    remaining_urls.at[index, 'text_content'] = processed
    time.sleep(3) # Introduce a delay of 3 seconds
remaining_urls['Sentiment'] = remaining_urls['text_content'].apply(calc_sentiment)
# remaining_urls['lexical_diversity'] = remaining_urls['text_content'].apply(lexical_diversity)

Translation failed: Request exception can happen due to an api connection error. Please check your connection and try again


In [13]:
# Inspect remaining_urls again
remaining_urls.head(10)

Unnamed: 0,url,text_content,category,Text_Length,Sentiment
0,facebook.com,Facebook - log in or sign up Connect with frie...,,667.0,0.8957
1,unpkg.com,"UNPKG UNPKG unpkg is a fast, global content de...",,3074.0,0.9689
2,paypalobjects.com,403 403 That’s an error.,,24.0,-0.4019
3,sentry.io,Application Performance Monitoring & Error Tra...,,10761.0,0.983
4,chase.com,"Credit Card, Mortgage, Banking, Auto | Chase O...",,1785.0,0.9921
5,zillow.com,Access to this page has been denied Press and ...,,273.0,-0.1838
6,wellsfargo.com,Wells Fargo Bank | Financial Services & Online...,,10353.0,0.9985
7,samsung.com,Samsung US | Mobile | TV | Home Electronics | ...,,23648.0,0.9998
8,pinterest.com,Pinterest Oh no! Pinterest doesn't work unless...,,60.0,-0.3595
9,cloudflare.com,"Connect, protect, and build everywhere | Cloud...",,7987.0,0.9988


In [2]:
from charset_normalizer import detect

# Open the file in binary mode to detect encoding
with open("tia-nltkmodel/data.csv", "rb") as file:
    result = detect(file.read())
    detected_encoding = result['encoding']

In [8]:
print(detected_encoding)

windows-1250


In [11]:
import pandas as pd
real_working_urls = pd.read_csv("tia-nltkmodel/data.csv", header=0, encoding=detected_encoding)
real_working_urls.head(10)

  real_working_urls = pd.read_csv("tia-nltkmodel/data.csv", header=0, encoding=detected_encoding)


Unnamed: 0,url,category,text_content,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 946,Unnamed: 947,Unnamed: 948,Unnamed: 949,Unnamed: 950,Unnamed: 951,Unnamed: 952,Unnamed: 953,Unnamed: 954,Unnamed: 955
0,google.com,Search Engines,Google About Store Gmail Images Sign in See mo...,,,,,,,,...,,,,,,,,,,
1,googleapis.com,Content Delivery Networks,,,,,,,,,...,,,,,,,,,,
2,apple.com,Computer and Internet Info,Apple Apple Apple Intelligence is??here. Exper...,,,,,,,,...,,,,,,,,,,
3,icloud.com,Online Storage and Backup,iCloud />,,,,,,,,...,,,,,,,,,,
4,facebook.com,Social Networking,Facebook - log in or sign up Connect with frie...,,,,,,,,...,,,,,,,,,,
5,youtube.com,Streaming Media,YouTube About Press Copyright Contact us Creat...,,,,,,,,...,,,,,,,,,,
6,googletagservices.com,Web Advertisements,,,,,,,,,...,,,,,,,,,,
7,amazon.com,Shopping,Amazon.com. Spend less. Smile more. Skip to ma...,,,,,,,,...,,,,,,,,,,
8,sc-static.net,Content Delivery Networks,,,,,,,,,...,,,,,,,,,,
9,t.co,Internet Communications and Telephony,t.co / Twitter Twitter uses the t.co domain as...,,,,,,,,...,,,,,,,,,,


In [21]:
# Concatenate data on the two parts of urls
df_combined = pd.concat([working_urls, remaining_urls], ignore_index=True)

(1136, 6)


In [None]:
# Merge with the given data (all urls and labels)
# to ensure the correct order based on 'url' and fill the 'category' column
complete_data = labels[['url', 'category']].merge(
    df_combined, on=['url', 'category'], how='left'
)

In [18]:
# Reexamine the content of 'zillow.com', which contained non-English text
print(remaining_urls.loc[remaining_urls['url'] == 'zillow.com', 'text_content'].iloc[0])

Access to this page has been denied Press and hold to confirm you are a human (and not a bot). Having problems? Reference ID 3055a895-ad61-11ef-87f4-aae8c35e1003 Report a problem Having problems with this page? Let us know: You can contact us for help. You should use the reference ID: 3055a895-ad61-11ef-87f4-aae8c35e1003 You can also send us your feedback: I’m a bot I don’t know where to check I keep getting the “Please try again” message Other (please specify below) Having another problem? Cancel Send ✓ Thanks for your feedback
