# iSleuth

## 1. Preamble

Along with the rise of the internet, there has been a marked increase in leaving feedback, reviews and commentary on a wide variety of day-to-day products and experiences, smartphones being one of them. However, the sheer volume of this data poses a problem to the keen designer, as they come in varying levels of detail, grammar and language, which take time to gather, process and distill into proper design prompts.

In recent years, Artificial Intelligence, a new development in the field of computing, where the speed and processing power of computers can be leveraged in the field of data collection and processing. This means that with a successful design of web scrapers, computer algorithms that scour the internet and collect data, and Artificial Intelligence to categorize, process and draw meaning from large amounts of data, designers can extract detailed prompts to guide their design thinking. This provides a unique opportunity for designers to leverage big data in their design processes, developing products that better meet the needs and demands of their markets.
This project aims to implement one such system in the industry of phone design and will detail in this report all the steps taken to design the algorithm before discussing the strengths and drawbacks of the said algorithm.


### 1.1 Deciding on the product

The iPhone, which was first introduced in 2007, is one of Apple's most significant products and has had a profound impact on the smartphone market. The iPhone is a line of smartphones that combines a mobile phone, an iPod, and an Internet communication device into a single device with a touchscreen interface. The original iPhone introduced a revolutionary design with a full touch screen, a sleek form factor, and an intuitive user interface that set a new standard for smartphones.
As the leading innovator in smartphone design, the large amount of speculation and critique each iteration receives provides a better testing ground for our algorithm, and also stands to benefit the most from such an innovation, leading us to choose to conduct our research on this particular series of phones on the market.


## 2.The Setup

### 2.1 Installing libraries

This project uses a lot of libraries as well as classification and text models.

I do not recommend running all the code on your computer as it will download all these models locally. While I will list tje code for the classifiers in this notebook, I ran them on google collab and would recommend doing the same.

Below are several libraries that were used in the project.

In [1]:
# !pip install amazon_search_results_scraper
# !pip install beautifulsoup4
# !pip install selenium
# !pip install clean-text
# !pip install langdetect
# !pip install google-api-python-client
# !pip install youtube-transcript-api
# !pip install praw

### 2.2 Creating folders

A lot of these functions were run at different times. Some simultaneously, some out of order, some over and over while testing... 

In order to prevent the data from becoming too messy, a folder to store the output of the various functions was created.

Data: The main folder, stores the initial data from webscrapping various websites

First_pass: Stores the data after they have been labelled in the first zero-shot classifier for the purposes of data cleaning

Second_pass: Stores the data after they have pass through the second classifier and have been labelled for the sake of classification

Final: The cleaned data from the different platforms compiled together and then are labelled with sentiment analysis and are then stored here

The function <b> create_directories() </b> doesn't take any input and is used to create the directories used to store the information and various csvs.

In [2]:
import os
def create_directories():
    try:
        os.makedirs("Data")
        print("Base directory created")
    except:
        print("base directory: Data exists")
    folders = ['Final', 'First_pass', 'Second_pass']
    for i in folders:
        path = 'Data'
        try:
            os.makedirs(os.path.join(path, i))
        except:
            print("Folder " + i + " exists.")

create_directories()

Base directory created


## 3.  Pre-analysis

### 3.0 Inputs

The specific product is the only input needed for this step.

In [None]:
product = 'iphone'

### 3.1 Forming the categories to classify comments into

Before we get started, we need to know more about the existing product we are trying to improve. 

This will allow us to create cactegories specific to the product we are trying to analyse so that we can then classify the comments that are scrapped properly

#### 3.1.1 Using openAI to get the strengths, flaws and competitors of the product

The first thing that was done was to use the openai library to find the strengths and the flaws of the product. We use specific prompts to ensure that the response is in a certain format that can be converted to a python dictionary.

The Competitors was also collected for use in data cleaning in the subsequent steps.

The function <b> ask_gpt(prompt) </b> is a general function that takes a prompt and sends it to the openai website before returning an answer generated in a string

This function is subsequently used in the functions <b> get_design_strength(keyword) </b>, <b> get_design_flaws(keyword) </b>, and <b> get_competitors(keyword) </b>. They each take the product ('iphone')  and then return a string of the dictionary of the response to the prompt created in the function.

It should be noted that this function was run with the particular keyword, iphone 11 instead iphone. But as all the webscrapping was already done, there would not be enough time to rescrape all the websites.

In [5]:
import openai
openai.api_key = "Insert your own api key here"

def ask_gpt(prompt):
    response = openai.Completion.create(
        engine="text-davinci-003", prompt=prompt, max_tokens=1024, n=1, stop=None, temperature=0.1
    )
    return response.choices[0].text.strip()


def get_design_strength(keyword):
    return ask_gpt("Give me the design strengths of the" + keyword + "in a python dictionary format")

def get_design_flaws(keyword):
    return ask_gpt("Give me the design flaws of the" + keyword + "in a python dictionary format")

def get_competitors(keyword):
    return ask_gpt("Give me the competitors of the" + keyword + "in a python dictionary format")

In [None]:
response2 = get_design_flaws("iphone 11")
response3 = get_design_strength("iphone 11")
response4 = get_competitors("iphone 11")
print(response2)
print(response3)
print(response4)

The response from the openAI is currently a string. So the ast library is used to convert it from a string into a python dictionary.

The function <b> convert_string_dict(string) </b> will then take the string returned from openai and convert it into a dictionary. Do note that the string has to be in a specific format to return a python dictionary properly.

In [None]:
import ast

def convert_string_to_dict(string):
    try:
        index = string.index("=")
        string = string[index+1:].strip()
    except:
        #no =
        pass
    return ast.literal_eval(string)

design_flaws =(convert_string_to_dict(response2))
design_strengths = (convert_string_to_dict(response3))
competitors = (convert_string_to_dict(response4))

This is the output from running the code:

In [None]:
design_flaws = {
    'Size': 'Too large for some users',
    'Battery Life': 'Shorter than expected',
    'Price': 'Expensive for the features offered',
    'Camera': 'Not as good as other flagship phones',
    'Storage': 'Limited to 64GB or 256GB'
}
design_strengths ={
    'Design': 'Sleek and modern',
    'Durability': 'Highly durable',
    'Display': 'High-resolution Retina display',
    'Camera': 'Dual-lens camera system',
    'Battery': 'Long-lasting battery life',
    'Water Resistance': 'IP68 water resistance rating',
    'Wireless Charging': 'Supports wireless charging'
}
competitors = {
    "Samsung Galaxy S20": "Samsung",
    "Google Pixel 4": "Google",
    "OnePlus 8 Pro": "OnePlus",
    "Huawei P40 Pro": "Huawei"
}

#### 3.1.2 Scrape google shopping to get categories 

Next we will scrape through the google shopping website. The google shopping website sorts reviews into distinct categories, which can be collected and used as potential categories.

Insert image of google shopping here

The following function, <b> googleshopping_get_id </b> and <b> googleshopping_get_keywords </b> were used to scrape the google shopping websites for their keywords. 

The function<b> googleshopping_get_id </b> is used first, it takes a list of search terms and will use that to find pages to scrape, returning a list of product_ids to look at.

The function <b> googleshopping_get_keywords </b> will then take the product_ids and the search terms in order to scrape the reviews from the various pages and return all the keywords stored in a df.

The function <b> save_data(data, file_name) </b> will then take the dataframe returned from the <b> googleshopping_get_keywords </b> and will then store it in the <b> Data </b> folder with the file_name, <b> google_shopping_scraped.csv </b> .

In [None]:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
import pandas as pd


def save_data(data, file_name):
    try:                                            # Create directory named after search terms
        os.makedirs("Data")
        print("Directory created")

    except FileExistsError:
        print("Directory exists")

    data.to_csv("Data/%s.csv" % (file_name))

def insearch_result(search_term, title):
    title = title.lower()
    for keywords in search_term:
        if keywords not in title:
            # print(title)
            return False
    return True

def googleshopping_get_id(search_terms):
    base_url = "https://shopping.google.com/"

    # initialising chrome and going to the url
    driver = webdriver.Chrome("chromedriver.exe")
    driver.get(base_url)

    # finding search bar and searching for search_terms
    search = driver.find_element("xpath", "//*[@id='REsRA']")
    # print("The input Element is: ", search)
    search.send_keys(" ".join(search_terms))
    search.send_keys(Keys.RETURN)

    # pausing to let the page load and scrape the page html
    time.sleep(5)
    page = driver.page_source

    # close the driver
    time.sleep(2)
    driver.close()

    # parse the page
    soup = BeautifulSoup(page, "html.parser")
    product_id = []
    for links in soup.find_all('a', href=True):
        if "/shopping/product" in links['href']:
            end_index = links['href'].index("?")
            id = links['href'][18:end_index]

            if id not in product_id:
                product_id.append(id)

    return product_id


def googleshopping_get_keywords(productID, search_terms):
    # Require a list of ProductID
    base_url = "https://www.google.com/shopping/product/"
    df = pd.DataFrame(columns=["Title", "Number of Reviews",
                      "Keywords", "Percentage", "Positive/Negative"])

    for i in range(len(productID)):

        if "/offers" in productID[i]:
            index = productID[i].index("/offers")
            productID[i] = productID[i][:index]

        url = base_url + productID[i] + "/reviews"
        # print(url)

        driver = webdriver.Chrome("chromedriver.exe")
        driver.get(url)
        time.sleep(5)
        page = driver.page_source
        time.sleep(2)
        driver.close()

        soup = BeautifulSoup(page, 'html.parser')

        title = soup.find("div", {"class": "f0t7kf"}).text
        anyreviews = soup.find("div", {"class": "rktlcd"})

        if anyreviews == None:
            # print("No reviews found")
            pass

        else:
            intitle = insearch_result(search_terms, title)

            if intitle:
                for span in soup.find_all("span", "QIrs8"):
                    text = span.text
                    # print(text)
                    if text == "Select to view all reviews":
                        pass

                    else:
                        # print(text)
                        reviews_index = text.index("r")
                        about_index = text.index("about")
                        full_stop_index = text.index(".")
                        percentage_index = text.index("%")

                        num_of_reviews = text[4: reviews_index].strip()
                        keywords = text[about_index +
                                        len("about")+1: full_stop_index]
                        percentage = text[full_stop_index+2: percentage_index]
                        positivenegative = text[-9:-1]
                        # print(num_of_reviews, keywords,
                        #       percentage, positivenegative)
                        df = df.append({"Title": title, "Number of Reviews": num_of_reviews, "Keywords": keywords,
                                       "Percentage": percentage, "Positive/Negative": positivenegative}, ignore_index=True)
                        # print(df)
    return df

search_terms = ['iphone']
product_ids = googleshopping_get_id(search_terms)
data = googleshopping_get_keywords(product_ids, search_terms)
save_data(data, "google_shopping_scraped")


The csv that was extracted can be seen in the Data folder, with the name google_shopping_scraped.csv

We then extract the keywords of from the csvs, making sure to exclude identical tags.

The function <b> extract_keywords_from_google_shopping_csv(file_location) </b> takes the location where the google shopping csv is stored and will then extract the keywords and return it as a list.

In [None]:
import csv
import pandas as pd

def extract_keywords_from_google_shopping_csv(file_location):
    keywords = []
    goolge_pd = pd.read_csv(file_location)
    for index, row in goolge_pd.iterrows():
        keyword = row[3]
        if keyword not in keywords:
            keywords.append(keyword)
    return keywords

google_keywords = (extract_keywords_from_google_shopping_csv("Data\google_shopping_scraped.csv"))

#### 3.1.3 Combine all keywords found (removing repeats) and use siamese network to remove similar categories

From there, we will then combine the keywords found together. The first step is to create a giant list of all the keywords.

The function <b> combine_keywords(strengths: list,flaws:list, google_keywords: list) </b> then takes lists of the strengths, flaws and google_keywords generated from the previous functions and returns a list of the compiled keywords.

In [None]:
def combine_keywords(strengths: list,flaws:list, google_keywords: list):
    keywords = []
    for strength in strengths:
        if strength not in keywords:
            keywords.append(strength).lower()
    for flaw in flaws:
        if flaw not in keywords:
            keywords.append(flaw).lower()
    for goolge in google_keywords:
        if goolge not in keywords:
            keywords.append(goolge).lower()
    return keywords

All_keywords = (combine_keywords(design_flaws.keys(),design_strengths.keys(),google_keywords))

This is the ouptut from combine keywords:

In [7]:
All_keywords = ['Size', 'Battery', 'Price', 'Camera', 'Durability', 'Design', 'Display', 'Water Resistance', 'Wireless Charging', 'Long battery life', 'Easy to use', 'Quality camera', 'Ease of setup', 'Weight', 'Attractive', 'Quality display', 'Sound quality', 'Charging speed', 'Design comfort', 'Easy to set up', 'Good sound quality', 'Comfortable to use', 'Charges quickly', 'Build quality', 'Craftsmanship', 'Lightweight', 'Ease of use', 'Camera quality', 'Battery life', 'Visual appeal', 'Display quality', 'Durable', 'Heavy', 'Speed', 'Noise level', 'Temperature control', 'Lacks durability', 'Poor sound quality', 'Minimal glare']

However, what was observed is that certain keywords were quite similar. A siamese network was then used to identify keywords that were similar and remove them. 

<b> This code was run on collab. </b>

The function <b> siamese_analysis(All_keywords) </b> takes the list of all the keywords from the previous steps and returns a list of keywords that has been cleaned with a siamese analysis.

In [None]:
#Go to Collab
from scipy import spatial
import gensim.downloader as api
import numpy as np

def siamese_analysis(All_keywords):
  model = api.load("glove-wiki-gigaword-50") #choose from multiple models https://github.com/RaRe-Technologies/gensim-data

  keywords = All_keywords
  def preprocess(s):
    return [i.lower() for i in s.split()]

  def get_vector(s):
    return np.sum(np.array([model[i] for i in preprocess(s)]), axis=0)

  vector_data = []
  for keyword in keywords:
    print(keyword)
    
    value = get_vector(keyword)
    vector_data.append(value)
    

  cleaned_keywords = []

  for i in range(len(vector_data)):
    for j in range(i, len(vector_data)):
      similarity = 1 - spatial.distance.cosine(vector_data[i], vector_data[j])
      if similarity < 0.2:
        cleaned_keywords.append(keywords[i])
        cleaned_keywords.append(keywords[j])
        # print(keywords[i], keywords[j])


  cleaned_keywords = list(set(cleaned_keywords))
  return cleaned_keywords

keywords = All_keywords
cleaned_keywords = siamese_analysis(keywords)
# print(cleaned_keywords)
print(len(keywords), len(cleaned_keywords))
print(cleaned_keywords)

These are the cleaned keywords from the run on the platform:

In [4]:
cleaned_keywords = ['Speed', 'Easy to set up', 'Camera', 'Comfortable to use', 'Wireless Charging', 'Charges quickly', 'Lightweight', 'Durability', 'Easy to use', 'Craftsmanship', 'Water Resistance', 'Heavy', 'Battery', 'Durable', 'Battery life', 'Price', 'Temperature control', 'Charging speed', 'Long battery life']

#### 3.1.4 Remove categories that are substring of other categories

There were still a few keyword categories that were substrings of other keywords like battery life and long battery life.

The function, <b> create_keywords </b>, takes the cleaned keywords from the previous step and removes the shorter string in the case that one string was the substring of another string. It then returns the list of the final keywords which will be stored in the variable <b> keywords </b>

In [None]:
def create_keywords(cleaned_keywords):
    cleaned_keywords.sort(key=len, reverse=False)

    for i in range(len(cleaned_keywords)):
        cleaned_keywords[i] = cleaned_keywords[i].lower()
    print(cleaned_keywords)

    # loop through each string in the list
    for i in range(len(cleaned_keywords)):
        # compare the string with all subsequent strings in the list
        for j in range(i+1, len(cleaned_keywords)):
            if cleaned_keywords[i] in cleaned_keywords[j]:
                # if the current string is a substring of another string, remove it from the list
                cleaned_keywords.pop(i)
                break
        else:
            # if the current string is not a substring of any subsequent strings, move on to the next string
            continue
        # if the current string was removed from the list, adjust the index accordingly
        i -= 1
    return cleaned_keywords

keywords = create_keywords(cleaned_keywords)

The following are the keywords that were generated in the initial run of the code:

In [None]:
keywords = ['heavy', 'price', 'camera', 'durable', 'durability', 'lightweight', 'easy to use', 'craftsmanship', 'easy to set up', 'charging speed', 'charges quickly', 'water resistance', 'wireless charging', 'long battery life', 'comfortable to use', 'temperature control']

### 3.2 Output

With the input of the product, we are able to fully automate the process of getting the following outputs:

<b> google_shopping_scraped.csv </b> is also saved in the <b> Data </b> folder with categories and number of reviews under each category.

<b>product </b>: A string of the name of the product that will be the focus of the study. For the sake of this report, this product will be 'iphone'

<b>keywords </b>: A list of keywords that can be used for classification using zero-shot or other text analysis modesl

<b>competitors </b>: A dictionary of competing products and companies that can be used for data screening

These are the following outputs:

In [10]:
keywords = ['heavy', 'price', 'camera', 'durable', 'durability', 'lightweight', 'easy to use', 'craftsmanship', 'easy to set up', 'charging speed', 'charges quickly', 'water resistance', 'wireless charging', 'long battery life', 'comfortable to use', 'temperature control']

competitors = {
    "Samsung Galaxy S20": "Samsung",
    "Google Pixel 4": "Google",
    "OnePlus 8 Pro": "OnePlus",
    "Huawei P40 Pro": "Huawei"
}

## 4.  Big Data collection

### 4.0 Inputs

Similarly to step 3, we are still collecting data and all we need is the product name (for the general websites)

In [None]:
product = 'iphone'

### 4.1 Choice of websites

A wide range of websites were chosen to collect the dataset that would be used to analyse the ways to improve the product.

Given that the product would usually be unknown, a few generic websites were chosen such that any product could appear on them. This led to the following websites being choosen:

<b>Amazon </b>: A popular e-commerce websites that is used for a wide range of products. It should be noted that amazon does have an anti-scraping policy. However this project was done for education purposes not for the sake of monetization and was done solely as a learning experience and proof of concept.

<b>Reddit </b>: Reddit is used as a wide range of topics are discussed on the site and is one of the more popular forum sites. 

<b>Youtube </b>: A very popular video playback site, there are a range of prodect reviews on the site, which will be prime for comment scrapping.


While doing the report, websites that were more related to the product were also scrapped and considered. It should be noted that if the entire process is to be automated, the database from these websites should be excluded.

<b>Hardware Forum zone </b>: This is a tech forum used by Singaporeans. They have a page dedicated to discussing the iphone and hence was selected for analysis.

<b> Apple Insider </b>: This page is a page that releases articles on apple products. It was also scrapped initially, however it was cut from the final version as it was too niche for the exam of the iphone chosen and the user base was likely to be biased as well.

### 4.2 Scraping the websites

#### 4.2.1 Amazon

Using the amazon_search_results_scraper API, we are able to find the product pages and then use selenium to collect the comments under the product pages found. 

The following function, <b>search_amazon(keyword, exceptions): </b>, takes the input of the product ('iphone') as the keyword, and will find the product pages. The function then uses selenium to go to the first page of the reviews and then scrape the comments from it. It should be noted that an exceptions variable was needed to optimise the search. However the creation of the exceptions keyword was not automated and inputed by a user. The function would then return a dataframe of the comments scraped.


The function <b> save_data(data, file_name) </b> will then take the dataframe returned from  <b> search_amazon(keyword, exceptions) </b> and will then store it in the <b> Data </b> folder with the file_name, <b>amazon_reviews.csv </b>.

In [None]:
from amazon_search_results_scraper import *
import pandas as pd
from bs4 import BeautifulSoup
import time
from selenium import webdriver
import random
from selenium.webdriver.common.by import By


def checkvalid_us(main_keyword, i):
    try:
        title = i['link']
        if main_keyword in title:
            return True
        else:
            return False
    except:
        return False
    
def create_link_to_crpage_us(link):
    new_link = link.replace("dp", "product-reviews")
    new_link, chop, chop_liver = new_link.partition("ref=")
    new_link = new_link + \
        'ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=1&sortBy=recent'
    return new_link

def amazon_create_different_page_links(link, number_of_pages_iteration=5):
    links = []
    links.append(link)
    for i in range(2, (number_of_pages_iteration + 1)):
        linked = links[-1]
        new_link = linked.replace(
            'pageNumber=' + (str(i-1)), 'pageNumber=' + (str(i)))
        links.append(new_link)
    return links
    
def search_amazon(keyword, exceptions):
    amazon.open("https://www.amazon.com/")
    main_keyword = keyword
    title_exceptions = exceptions
    amazon.search(keyword=main_keyword)

    response = amazon.search_results()
    search_results = response['body']

    pages = []
    for i in search_results:
        if checkvalid_us(main_keyword, i):
            pages.append(i['link'])
        else:
            pass
            # print('no valid link or no valid_title')

    # print(create_link_to_crpage_us(pages[1]))

    custom_review_pages = []
    for i in pages:
        to_add = create_link_to_crpage_us(i)
        custom_review_pages.append(to_add)
    
    #links stored in custom_review_pages
    df = pd.DataFrame(columns=["Title" , "Link", "Stars", "Comments"])
    list_links = []
    for link in custom_review_pages:
        options = webdriver.ChromeOptions()
        options.add_argument(
        'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3')
        driver = webdriver.Chrome(options=options)
        driver.get(link)
        htmlsource = driver.page_source
        soup = BeautifulSoup(htmlsource, 'html.parser')
        driver.quit()
        time.sleep(2)
        try:
            title = soup.find('h1')
            title = (title.findChildren('span')[0].text.strip())
            review_section = soup.find('div', {'id': "cm_cr-review_list"})
            # find all reviews
            reviews = review_section.find_all('div', {'class': "review"})
            # for each review
            for i in reviews:
                # collect stars
                stars = i.find('i', {'class': 'review-rating'})
                stared = (stars.findChildren('span')[0]).text.strip()
                # collect review
                review = i.find(
                    'a', {'class': 'a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold'})
                review = (review.findChildren(
                    'span', recursive=True)[0]).text.strip()
                df.loc[len(df)]  = {"Title": title, "Link": link, "Stars": str(stared), "Comments":str(review)}  
        except:
            print(soup)
        time.sleep(10)
    return df
            

In [None]:
df_amazon_reviews =search_amazon('iphone', ['Case', 'case'])
save_data(df_amazon_reviews, "amazon_reviews")

#### 4.2.2 Reddit

Using praw, reddits own built in api, one can scrape the website for comments about a certain topic.

The following function, <b> combined_reddit(search_term: str, post_limit=10): </b>, takes the input of the product ('iphone') as the search_term, and will find the pages related to it and collect the comments from the top pages. It then returns a dataframe of the information.

The function <b> save_data(data, file_name) </b> will then take the dataframe returned from  <b> combined_reddit(search_term: str, post_limit=10): </b> and will then store it in the <b> Data </b> folder with the file_name, <b>reddit_scrapped.csv </b>.

In [None]:
import praw
import re
import pandas as pd
from cleantext import clean

def scrape_reddit(search_term:str, post_limit=10):
    df = pd.DataFrame(columns=['Posts', 'Comments'])
    reddit = praw.Reddit(client_id='sQePkNsCdxJgOehVWkLa6A',
                         client_secret='52ZjTdPwfQLP_RuXKcnDPe2Vs2Myxg',
                         user_agent='<console:HAPPY:1.0')
    subreddit = reddit.subreddit(search_term)
    for submission in subreddit.hot(limit=post_limit):
        for comment in submission.comments:
            if hasattr(comment,'body'):
                index = df.shape[0]
                df.loc[index] = [submission.title, comment.body]
    return df

def cleanup_reddit(uncleaned_frame):
    df = pd.DataFrame(columns=['Posts', 'Comments'])
    uncleaned_frame.reset_index()
    for index, row in uncleaned_frame.iterrows():
        Post, Comment = row['Posts'], row['Comments']
        if Comment == "[deleted]":
            continue
        subComments = []
        commented = Comment.split(". ")
        for x in commented:
            subComments.append(x)
        for i in subComments:
            commentina = re.split("[?:!]", i)
            for j in commentina:
                j = clean(j, no_emoji=True)
                if j == "":
                    continue
                index = df.shape[0]
                df.loc[index] = [Post, j]
    return df


def combined_reddit(search_term: str, post_limit=10):
    df = scrape_reddit(search_term,post_limit)
    output = cleanup_reddit(df)
    return output
        
dataframe_reddit = combined_reddit("iphone")
save_data(dataframe_reddit, "reddit_scrapped")        

#### 4.2.3 Youtube

Using youtube's API, we are able to find the top videos of a certain search and scrape the comments from them.

The following function, <b> youtube_search(search_terms, 5) </b>, takes the input of the product ('iphone') and the keyword 'review' as the search_term, and will find the 5 videos related to it and collect the video ids of the top pages. It then returns a list of the video ids.

The function, <b> youtube_comments(vid_id) </b> then takes the list of video ids and scrapes the comments from the videos and puts it into the a dataframe.

The function <b> save_data(data, file_name) </b> will then take the dataframe returned from  <b>youtube_comments(vid_id) </b> and will then store it in the <b> Data </b> folder with the file_name, <b>youtube comments.csv </b>.

In [None]:
import googleapiclient.discovery
from youtube_transcript_api import YouTubeTranscriptApi
import pandas as pd
from deepmultilingualpunctuation import PunctuationModel
from cleantext import clean

api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = "Insert your own key here"

youtube = googleapiclient.discovery.build(
    api_service_name, api_version, developerKey=DEVELOPER_KEY)


def youtube_search(search_term, maxresults):  # returns video ID
    # assuming that maxresults is always <=50
    vidID = []

    # print("Searching for videos")

    request = youtube.search().list(
        q=" ".join(search_term),
        part="id",
        type="video",
        maxResults=maxresults)

    search_response = request.execute()

    for i in range(maxresults):
        videoID = search_response['items'][i]['id']["videoId"]
        # videoLinks = "https://www.youtube.com/watch?v=" + videoID
        vidID.append(videoID)

    return vidID


# using video ID, get all the comments and likes to put into dataframe
def youtube_comments(vidID):
    df = pd.DataFrame(columns=['title', 'Comments', 'likes'])
    for i in range(len(vidID)):

        title, video_response = get_title(vidID[i])

        try:  # use try/except to check if comments exists
            comment_count = video_response['items'][0]['statistics']['commentCount']
            # print("Video-", title, "-- Comment count: ", comment_count)

            request_comment = youtube.commentThreads().list(
                part="snippet, replies",
                videoId=vidID[i])

            comment_response = request_comment.execute()
            df = get_all_comments(comment_response, title, df)
            test = comment_response.get("nextPageToken", "nil")

            while test != 'nil':  # load next page of comments
                next_page_ = comment_response.get('nextPageToken')
                request = youtube.commentThreads().list(  # new request for next pag
                    part="snippet,replies",
                    pageToken=next_page_,
                    videoId=vidID[i]
                )
                comment_response = request.execute()

                df = get_all_comments(comment_response, title, df)
                test = comment_response.get('nextPageToken', 'nil')

        except:
            print("Video", i + 1, "-", title,
                  "-- Comments are turned off, ignoring video")
    return df


def get_title(vid_id):
    request = youtube.videos().list(
        part="snippet, statistics",
        id=vid_id)

    video_response = request.execute()
    title = video_response['items'][0]['snippet']['title']
    return title, video_response


def get_all_comments(response, title, df):
    for comment in response['items']:
        comment_text = comment['snippet']['topLevelComment']['snippet']['textDisplay']
        comment_text = clean(comment_text, no_emoji=True)
        likes_count = comment['snippet']['topLevelComment']['snippet']['likeCount']
        # print(comment_text, likes_count)
        if 'replies' in comment.keys():
            for reply in comment['replies']['comments']:
                rtext = reply['snippet']['textDisplay']
                rtext = clean(rtext, no_emoji=True)
                rlike = reply['snippet']['likeCount']
                # print(rtext, rlike)
                df = df.append({"title": title, "Comments": rtext,
                               "likes": rlike}, ignore_index=True)

        df = df.append({"title": title, "Comments": comment_text,
                       "likes": likes_count}, ignore_index=True)
    return df


search_terms = [product] + ["review"]
vid_id = youtube_search(search_terms, 5)
df_comments = youtube_comments(vid_id)
save_data(df_comments, "youtube comments")


#### 4.2.4 Hardware Forum Zone

Using selenium we can then scrape Hardware zone for various comments about the product. This function does require that the specific link to the iphone chat room be provided. However, as this is a niche pick, this website would not be scrapped if the process was fully automated anyways. Therefore the link to the website was provided to expedite the process. 

The following function, <b> df = search_hardware_zone_forum(main_keyword, number_of_pages) </b>, takes in the product as the main keyword to ensure the forum is talking about the product and scrapes pages according to the number_of_pages. It then returns the dataframe of the comments scrapped in a dataframe.

The function <b> save_data(data, file_name) </b> will then take the dataframe returned from  <b>youtube_comments(vid_id) </b> and will then store it in the <b> Data </b> folder with the file_name, <b>hardware_zone.csv </b>.

In [None]:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import pandas as pd
from cleantext import clean

def generate_links_hardware_zone(basic_link, count):
    listed = []
    for i in range(count):
        listed.append(basic_link + 'page-' + str(i))
    return listed

def search_hardware_zone_forum(main_keyword, number_of_pages):
    basiclink = 'https://forums.hardwarezone.com.sg/forums/the-iphone-chat-room.240/'
    links = generate_links_hardware_zone(basiclink,number_of_pages)
    options = webdriver.ChromeOptions()
    options.add_argument(
        'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3')
    options.add_experimental_option('excludeSwitches', ['enable-logging'])
    driver = webdriver.Chrome(options=options)
    
    list_links = []
    for link in links:
        driver.get(link)
        htmlsource = driver.page_source
        driver.quit()
        time.sleep(5)
        soup = BeautifulSoup(htmlsource, 'html.parser')
        posts = soup.find_all('div', {'class': 'structItem--thread'})
        for post in posts:
            posted = post.find('div', {'class': 'structItem-title'})
            info = post.find('div', {'class': 'structItem-cell--meta'})
            linkedin = posted.findChildren("a", recursive=True)
            try:
                link_to_thread = (linkedin[0]['href'])
                numberofposts = info.findChildren(
                    "a", recursive=True, href=True)[0].text.strip()
                link_to_thread = "https://forums.hardwarezone.com.sg" + link_to_thread
                if main_keyword in link_to_thread:
                    list_links.append([link_to_thread, numberofposts])
            except:
                print('no link found or cannot find number of replies')
    
    for listed_link in list_links:
        link = listed_link[0]
        options = webdriver.ChromeOptions()
        options.add_argument(
            'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3')
        options.add_experimental_option('excludeSwitches', ['enable-logging'])
        driver = webdriver.Chrome(options=options)
        driver.get(link)
        htmlsource = driver.page_source
        time.sleep(5)
        driver.quit()
        time.sleep(10)
        soup = BeautifulSoup(htmlsource, 'html.parser')
        pageNav = soup.find_all('li', {'class': 'pageNav-page'})
        CommentPageList = []
        try:
            lastpage = pageNav[-1]
            count = int(lastpage.findChildren(
                'a', recursive=True)[0].text.strip())
            pages = 5
            # if managed to find the number of pages per forum sheet,
            # then add the number of pages into the list starting from the back
            while count > 0 and pages > 0:
                new_link = link + 'page-' + str(count)
                CommentPageList.append([link, count, new_link])
                pages -= 1
                count -= 1

        except:
            # ok so only one page
            print('Could not find pageNav')
            new_link = link + 'page-' + str(1)
            CommentPageList.append([link, 1, new_link])
    
    df = pd.DataFrame(columns=["Forum Link" , "Page count", "Comments"])
    for yesnt in CommentPageList:
        options = webdriver.ChromeOptions()
        options.add_argument(
            'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3')
        options.add_experimental_option('excludeSwitches', ['enable-logging'])
        driver = webdriver.Chrome(options=options)
        unnecessary, Page_number, Link = yesnt[0], yesnt[1], yesnt[2]
        driver.get(Link)
        time.sleep(3)
        htmlsource = driver.page_source
        driver.quit()
        time.sleep(5)
        soup = BeautifulSoup(htmlsource, 'html.parser')
        responses = soup.find_all('div', {'class': 'bbWrapper'})
        for response in responses:
            df.loc[len(df)]  = {"Forum Link": clean(Link, no_emoji=True), "Page count": Page_number, "Comments": response.text.strip()}  
    return df

df = search_hardware_zone_forum('iphone')
save_data(df, "hardware_zone")


##### 4.2.4.1 Cleaning the output from hardware zone

The output from the hardware forum zone requires some cleaning due to the format it was received in. When someone on the forum replies to another comment, that comment will appear in the subsequent comment as well. Therefore the primary goal is to remove any references to other comments.

The function <b> clean_hardware_zone(path) </b> takes the path to the csv file with the comments from hardware zone and for every string, removes all text before "Click to expand..." before storing them back into a dataframe.

The function <b> save_data(data, file_name) </b> will then take the dataframe returned from  <b>clean_hardware_zone(path) </b> and will then store it in the <b> Data </b> folder with the file_name, <b>hardware_zone.csv </b>. This overrides the previous csv. 

In [None]:
def read_csv(file_path):
    file = open(file_path, encoding="utf8")
    cleaned = list(csv.reader(file))[1:]
    return cleaned

def clean_hardware_zone(path):
    hard_ware = read_csv(path)
    df = pd.DataFrame(columns=["Comments"])
    for row in hard_ware:
        sentence = row[-1]
        if "Click to expand..." in sentence:
            slicing_index  = sentence.index("Click to expand...") + 18
        else:
            slicing_index = 0
        sentence = (sentence[slicing_index:].strip())
        df.loc[len(df)]  = {"Comments": sentence}  

    return df

df = clean_hardware_zone("Data/hardware_zone.csv")
save_data(df, "hardware_zone")

### 4.3 Output

From scrapping all this websites, the following csvs were saved into <b> Data </b>:

<b> amazon_reviews.csv </b> : Each row contains the Title of the item , the Link to it , and the Stars and the review for each comment.

<b> reddit_scrapped.csv </b> : Each row contains the Posts that the comment comes from, as well as the comment itself.

<b> youtube comments.csv </b> : Each row contains the title the comment comes from , the comment itself and the likes each comment received.

<b> hardware_zone.csv </b> : Contains the Comment from each post in the forum

<b> ~~apple_insider_scraped.csv </b> : Initial scrape of apple insider. Will not be used subsequently.~~

All the data are currently very messy and have different headers for storing the same kind of information. The csvs will be cleaned and stored in <b> First_pass </b> after processing and standardisation. 

## 5. Data processing

### 5.0 Inputs

In this step, we are processing all the information collected from the previous step so we have to take the following inputs from the previous steps:

<b> product </b>: A string of the product used for analysis, this was generated before step 3. 

<b> competitors </b>: A dictionary of competing products and their associated company. This was generated by openai in step 3.

<b> CSVs of comments </b>: various csvs from containing the comments scrapped from various websites. This will be found in the root directory of the <b> Data  </b> folder.

In [11]:
product = "iphone"

competitors = {
    "Samsung Galaxy S20": "Samsung",
    "Google Pixel 4": "Google",
    "OnePlus 8 Pro": "OnePlus",
    "Huawei P40 Pro": "Huawei"
}

### 5.1 First filter to remove unrelated comments

After collecting all the data, there is a high chance that many comments may not be particularly helpful or even related to the product. Therefore we need to clean the data before we can categorise the comments properly.

In order to clasify the comments, a zero-shot classification was used using a pretrained model. It takes certain labels and tries to identify the correlation of each comment to each tag.

#### 5.1.1 Generating labels for zero-shot classification

In order to create the labels for the sake of data cleaning, there needs to be categories to sort the unrelated comments into. first_filter_keywords is a set of predetermined generic categories meant to catch unrelated comments. From there we add a tag for the <b> product </b> as well as each of its competing products, using the<b> competitors </b> dictionary made using openAI in the earlier steps. 

The function <b>get_first_filter(product, competitors, filters) </b> takes the product, competitors and the filters and combines them to return a list of the labels that can be used for the first round of zero shot classification.

In [None]:
#Get the full list of fake categories and actual categories for the product
import os
import csv
import pandas as pd

def read_csv(file_path):
    return pd.read_csv(file_path)
    # file = open(file_path, encoding="utf8")
    # cleaned = list(csv.reader(file))[1:]
    # return cleaned
    

first_filter_keywords = [
        'first',
        'game',
        'app',
        'Thanks',
        'great video',
        'video quality',
        'links href a',
        'video review',
        'subscribed',
        'offtopic',
    ]

def get_first_filter(product, competitors, filters):
    filter = []
    filter.append("Comment about " + product)
    for i in competitors:
        filter.append("Comment about " + i)
    filter = filter + filters
    return filter



first_filter = get_first_filter(product, competitors, first_filter_keywords)
first_filter

#So ummm afer cleaning 4400 comments, 1412 are more than 0.3 and only 464 are related lmao



The is the first_filter generated for the case of the iphone

In [8]:
first_filter = ['Comment about iphone',
 'Comment about Samsung Galaxy S20',
 'Comment about Google Pixel 4',
 'Comment about OnePlus 8 Pro',
 'Comment about Huawei P40 Pro',
 'first',
 'game',
 'app',
 'Thanks',
 'great video',
 'video quality',
 'links href a',
 'video review',
 'subscribed',
 'offtopic']

#### 5.1.2 Zero shot classification

Now that we have our labels we can run the zero shot classifier on them. 

<b> The following code was run on collab </b>

This allows the progress to be preiodically saved and run in the background as other work is being done on the computer. 

There is a <b>CSVsForAid </b> folder which is where one would put the csv for processing and there is another <b> Output </b> folder inside the folder where the output of the code is deposited. 

The <b> save_to_drive(df,filename):</b> function is used to save the existing progress on the dataframe to the drive, allowing for progress to be logged frequently in case collab crashes. 

The <b> read_from_drive(file_path, specific_column): </b> function is used to load uninitialised csvs which have not been worked on before.

The <b> check_progress_csv(file_path):</b> function is used to load csvs with some progress in them which are already stored in the output folder on the drive. 

The <b> df_labeller_by_20s(filename, factors, specific_column): </b> function is used to label all the data in the csv. It takes in the filename, the first_filter as the factors, requires the specific_column where the comments are. 

It will output a csv with the columns <b> Comments: </b>  the sentence, <b> label:  </b> which category the sentence falls in and <b> score: </b> how confident the model is in its prediciction for that sentence. 

The function first checks if existing progress has been made by checking the <b> Output </b> folder with <b> check_progress_csv(file_path):</b>, if no progress has been made, it will use <b> read_from_drive(file_path, specific_column): </b> to get the raw csv and initialise the dataframe properly with the relevant columns as elaborated in the previous paragraph. Anything that has not been labelled will have label and score set to 'Not Done' and an invalid sentence will have its label and score set to 'Not valid Sentence'.

It then runs the classifier on each line and saves the csv into the drive once it has made 20 predictions with more than 30% certainty.

In [None]:
from google.colab import drive
from transformers import pipeline
import pandas as pd
import io
from datetime import datetime

classifier = pipeline('zero-shot-classification')


def save_to_drive(df,filename):
  path = '/content/drive/My Drive/CSVsForAid/Output/' + filename + '_output.csv'
  with open(path, 'w', encoding = 'utf-8-sig') as f:
    df.to_csv(f)

def read_from_drive(file_path, specific_column):
  path = '/content/drive/My Drive/CSVsForAid/' + file_path + '.csv'
  df = pd.read_csv(path)
  answer = df[[specific_column]]
  return answer

def check_progress_csv(file_path):
  path = '/content/drive/My Drive/CSVsForAid/Output/' + file_path + '_output.csv'
  try:
    df = pd.read_csv(path)
    return df
  except:
    return 1
  
def df_labeller_by_20s(filename, factors, specific_column):
  #Label every 20 sets and then save to the csv
  #If df does not exist in proper form first then create
  #Find the column to start
  
  #Open the csv from drive
  trail = check_progress_csv(filename)
  if isinstance(trail, pd.DataFrame):
    print("Progress csv found, starting from existing index")
    df_input = check_progress_csv(filename)[[specific_column, 'label', 'score']]
  else:
    print("No progress csv found, reading from folder")
    df_input = read_from_drive(filename, specific_column)
    #Initialise the dataframe and add the columns if not available
    columns = df_input.columns
    if 'label' not in columns:
      df_input['label'] = 'Not Done'
      df_input['score'] = 'Not Done'

  #iterate through the whole df and every 100, save the information
  #finding the first instance of Not Done
  index = (df_input[df_input.label == "Not Done"].index[0])
  ending_index = df_input.shape[0]
  count = 0

  #Proceed to iterate through
  while index < ending_index:
    if count == 20:
      save_to_drive(df_input, filename)
      count = 0
      now = datetime.now()
      dt_string = now.strftime("%d/%m/%Y %H:%M:%S")
      print("saved " + str(index) + " sentence at " + dt_string)	
      
    try:
        sentence = df_input.loc[index][-3]
        result = classifier(sentence, factors)
        label = result['labels'][0]
        score = result['scores'][0]
        df_input.loc[index]['label'] = label
        df_input.loc[index]['score'] = score
        if score >= 0.3:
            count += 1
            index += 1
    except:
      df_input.loc[index]['label'] = 'Not Valid Sentence'
      df_input.loc[index]['score'] = 'Not Valid Sentence'
      print('skipping' + str(index))
      index += 1
  #Leave here and save one last time to confirm
  #function to save
  save_to_drive(df_input, filename)
  return df_input


answer = df_labeller_by_20s("youtube_comments", first_filter, 'comments')
answer


Once the processing was done on the selected the csvs they were downloaded and place back into this folder under <b> First_pass </b>

### 5.2 Recombining all the data together

The data was then scanned through and the sentences that make it through to the second round of classification are selected. Only Comments related to the product and whose confidence in classification is more than 30% will be selected for evaluation.

The function <b> combine_csvs(folder_location, product) </b> iterates through all csv in the folder_location and uses the product name to reconstruct the label used in the first_filter. It will then return a dataframe with the comments that belong to that label and are more than 30% certain of being in that category.

The function <b> save_data(data, file_name) </b> will then take the dataframe returned from  <b> combine_csvs(folder_location, product) </b> and will then store it in the <b> Data/Second_pass </b> folder with the file_name, <b>check.csv </b>.

In [None]:
#In between the filtering: combine those related to iphone together and those to the competitors together, dropping the others (only added if more than 0.3 certain between layers)
#Standardised all other scrappers to have Comments, label, score tabs
import pandas as pd
import os
def read_csv(file_path):
    return pd.read_csv(file_path)

def combine_csvs(folder_location, product):
    product_stuff = pd.DataFrame(columns=['Comments'])
    product_filter = "Comment about " + product
    certainty = 0.3
    for filename in os.listdir(folder_location):
        print(filename)
        if filename.endswith('.csv'):
            filepath = os.path.join(folder_location, filename)
            dataframe = read_csv(filepath) 
            for index, row in dataframe.iterrows():
                comment = str(row['Comments']).strip()
                label = row['label']
                score = row['score']
                if score ==  "Not Valid Sentence":
                    continue
                if score == "Not Done":
                    break
                if float(score) > certainty:
                    if label == product_filter:
                        product_stuff.loc[len(product_stuff)]  = {"Comments": comment}  
    return product_stuff
    
                       
this =combine_csvs('Data/First_pass', 'iphone', competitors)     
save_data(this, "Second_pass/check")


The data from this function is saved in check.csv under Second_pass in Data.

### 5.3 Classification back into categories identified in part 2 

Now that the data has been cleaned, it can be put into the categories based on the <b> keywords </b> identified earlier.

#### 5.3.1 Labels used for round 2

We use the <b> keywords </b> found in step 3 to serve as the categories to sort the data. As they could still be unrelated comments, a second set of generic filters was created to filter out unwanted comments

The function <b>> get_categorization_filter(keywords, filters) </b> takes in the keywords found in step 3 as well as a predefined list of strings which is used as the second round of filtering. It then returns the list of categories that will be used for the second round of zero-shot classification.

In [None]:
def get_categorization_filter(keywords, filters):
    final_filter = keywords + filters
    return final_filter

#might need a second filter ? using first_filter_keywords as a stand in
#keywords come from create_keywords in 2c

second_filter_keywords = [
    'generic',
    'popularity',
    'links',
    'offtopic',
    'suggestion',
    'advice'
]


categories = get_categorization_filter(keywords, second_filter_keywords)

This is the output for the iphone

In [9]:
categories = ['heavy',
 'price',
 'camera',
 'durable',
 'durability',
 'lightweight',
 'easy to use',
 'craftsmanship',
 'easy to set up',
 'charging speed',
 'charges quickly',
 'water resistance',
 'wireless charging',
 'long battery life',
 'comfortable to use',
 'temperature control',
 'generic',
 'popularity',
 'links',
 'offtopic',
 'suggestion',
 'advice']

#### 5.3.2 Zero shot classification round 2

The code still remains the same as the first zero shot classification, however the factors used has changed from <b> first_filter </b> to <b> categories </b>.

In [None]:
from google.colab import drive
from transformers import pipeline
import pandas as pd
import io
from datetime import datetime

classifier = pipeline('zero-shot-classification')


def save_to_drive(df,filename):
  path = '/content/drive/My Drive/CSVsForAid/Output/' + filename + '_output.csv'
  with open(path, 'w', encoding = 'utf-8-sig') as f:
    df.to_csv(f)

def read_from_drive(file_path, specific_column):
  path = '/content/drive/My Drive/CSVsForAid/' + file_path + '.csv'
  df = pd.read_csv(path)
  answer = df[[specific_column]]
  return answer

def check_progress_csv(file_path):
  path = '/content/drive/My Drive/CSVsForAid/Output/' + file_path + '_output.csv'
  try:
    df = pd.read_csv(path)
    return df
  except:
    return 1
  
def df_labeller_by_20s(filename, factors, specific_column):
  #Label every 20 sets and then save to the csv
  #If df does not exist in proper form first then create
  #Find the column to start
  
  #Open the csv from drive
  trail = check_progress_csv(filename)
  if isinstance(trail, pd.DataFrame):
    print("Progress csv found, starting from existing index")
    df_input = check_progress_csv(filename)[[specific_column, 'label', 'score']]
  else:
    print("No progress csv found, reading from folder")
    df_input = read_from_drive(filename, specific_column)
    #Initialise the dataframe and add the columns if not available
    columns = df_input.columns
    if 'label' not in columns:
      df_input['label'] = 'Not Done'
      df_input['score'] = 'Not Done'

  #iterate through the whole df and every 100, save the information
  #finding the first instance of Not Done
  index = (df_input[df_input.label == "Not Done"].index[0])
  ending_index = df_input.shape[0]
  count = 0

  #Proceed to iterate through
  while index < ending_index:
    if count == 20:
      save_to_drive(df_input, filename)
      count = 0
      now = datetime.now()
      dt_string = now.strftime("%d/%m/%Y %H:%M:%S")
      print("saved " + str(index) + " sentence at " + dt_string)	
      
    try:
        sentence = df_input.loc[index][-3]
        result = classifier(sentence, factors)
        label = result['labels'][0]
        score = result['scores'][0]
        df_input.loc[index]['label'] = label
        df_input.loc[index]['score'] = score
        if score >= 0.3:
            count += 1
            index += 1
    except:
      df_input.loc[index]['label'] = 'Not Valid Sentence'
      df_input.loc[index]['score'] = 'Not Valid Sentence'
      print('skipping' + str(index))
      index += 1
  #Leave here and save one last time to confirm
  #function to save
  save_to_drive(df_input, filename)
  return df_input


answer = df_labeller_by_20s("youtube_comments", categories, 'comments')
answer


The output for this function is then downloaded and stored inside <b> Data/Final </b> as <b> check_output.csv </b>.

#### 5.4 Output

After processing, all the data, the comments that will be used to derive our design opportunities for the product have finally been separated out. 

<b> check_output.csv </b>: This is the csv holding all the data, it has the comment, the label for what categories it belongs to and the confidence score. It is located in the <b> Data/Final</b> folder.

## 6. Identification of areas of opportunity

### 6.0 Input

In this step, take all the the information collected from the previous steps and desrive the outputs.

<b> product </b>: A string of the product used for analysis, this was generated before step 3. 

<b> keywords </b>: a list of keywords generated as the ouput of step 3, this will be used as the categories to filter the comments into

<b> competitors </b>: A dictionary of competing products and their associated company. This was generated by openai in step 3.

<b> CSVs of final comments </b>: This is the csv of the comments that we will be analysing. It can be found in the <b> Data/Final  </b> folder.

In [12]:
product = 'iphone'

keywords = ['heavy', 'price', 'camera', 'durable', 'durability', 'lightweight', 'easy to use', 'craftsmanship', 'easy to set up', 'charging speed', 'charges quickly', 'water resistance', 'wireless charging', 'long battery life', 'comfortable to use', 'temperature control']

competitors = {
    "Samsung Galaxy S20": "Samsung",
    "Google Pixel 4": "Google",
    "OnePlus 8 Pro": "OnePlus",
    "Huawei P40 Pro": "Huawei"
}

### 6.0.1 Intended Outputs

There are several outputs that will be delivered

<b> 1. The top three categories that future iterations of the product should focus on:</b> This can be determined by performing a quantitative analysis of the data.  The most important categories will be the categories with the highest absolute score as they will be overwhelmingly positive or negative. We will then return the top 3 categories as the categories they should focus on.

<b> 2. The ways to improve the top three categories that future iterations of the product:</b> 
This can be done using a qualitative analysis. We will take all the comments said about the top categories and group them together before summarising them and finding out what is the best way to improve the category.

<b> 3.  A design problem statement to start designers ideating on improve the product:</b> 
Using the categories in output 1, we can use openai to generate a design prompt for designers to ideate on.


### 6.1 Sentiment Analysis

We will first use a sentiment analysis model to analyse the data and label each comment positive or negative.

We can then group the comments by the category they belong to and assign a score to each category

The function <b> sentiment_labeller(df, categories): </b> takes the dataframe of the csv created in the previous steps and the keywords as the categories. It then outputs a dictionary of Scores for each category, a dictionary of all the positive things said for each category and a dictionary of all the negative things said in each category.

In [None]:
import string
#Sentiment analysis
from transformers import pipeline

import pandas as pd
import os
def read_csv(file_path):
    return pd.read_csv(file_path)

processed = read_csv('Data\Final\check_output.csv')


#The goal of this function is to take a dataset and for each comment in it label it is something positive or negative, the data would then be send to the word storages 
#df is the dataframe to be labelled and categories is the processed keywords
#categories
def sentiment_labeller(df, categories):
    sentiment_pipeline = pipeline("sentiment-analysis")

    sentences = []
    for index, row in df.iterrows():
        sentence, category = row['Comments'], row['label']
        sentences.append(sentence)

    #Takes the labels positive and negative and adds it as a new column to the dataframe supplied
    new_list = list(map(lambda x: x['label'], sentiment_pipeline(sentences)))
    df['Sentiment'] = new_list
    
    #Now create the stuff to store the things in
    #postive_word_storage is the compliments in the category
    #negative_word_storage is the complains in the category
    #storage is just the score
    storage = {}
    positive_word_storage = {}
    negative_word_storage = {}
    for x in categories:
        storage[x] = 0
        positive_word_storage[x] = ""
        negative_word_storage[x] = ""
        
    #iterate through all the rows
    for index, row in df.iterrows():
        comment = row['Comments']
        category = row['label']
        positiveNegative = row['Sentiment']
        active = None
        if category in categories:
            if positiveNegative == "POSITIVE":
                storage[category] += 1
                active = positive_word_storage
            else:
                storage[category] -= 1
                active = negative_word_storage
            comment = comment.strip()
            if comment[-1] in string.punctuation:
                active[category] = active[category] + comment
            else:
                active[category] = active[category] + comment + '.'
        Qualatative = sorted(list(storage.items()), key= lambda x: abs(x[1]), reverse=True)
    return Qualatative, positive_word_storage, negative_word_storage
#Quantitative data analysed, now go find the largest numerical category and if they are good or bad

Qualatative, positive, negative = sentiment_labeller(processed,keywords)

#Done once for the iphone and once for the competition

This is the output for the iphone

In [None]:
Qualatative = [('heavy', -17),
 ('charges quickly', -9),
 ('durable', -6),
 ('lightweight', -5),
 ('charging speed', -4),
 ('copmfortable to use', 4),
 ('price', -3),
 ('durability', -3),
 ('easy to use', -2),
 ('craftsmanship', 2),
 ('long battery life', 2),
 ('camera', -1),
 ('temperature control', -1),
 ('easy to set up', 0),
 ('water resistance', 0),
 ('wireless charging', 0)]

positive ={'heavy': "i bought a new iphone 14 in late december and my battery health is now 98%. i'm an average to heavy user so i can't decide if that's normal degradation or not. i have excellent screen-on time, usually around 7 or 8hrs until battery reaches 20%. anyone have some comments?",
 'price': 'Phone is worth it.Great Price on a Top-of-the-Line iPhone Pro!Good value for refurbished iphone.Great Option for a Reasonably Priced iPhone.Good value for refurbished iphone.Great Option for a Reasonably Priced iPhone.so i have a friend who sells his iphone 12 pro max and he told me if i want to give him my iphone 13 mini + 100$ to get the iphone 12 pro max, so i am really thinking about it, is it worth it.i think it\'s worth the $100 more if you\'re choosing which to buy, but if you have a iphone 13 then you should hold on to it.you know it\'s funny. apple is good at this but honestly it\'s not something i mind because most people get their phone through verizon or sprint or t-mobile etc. and those companies generally offer you an upgrade and nick like 700 dollars off getting a new phone in order to keep you for two more years. contract free of course but if you leave early you pay off the balance. so really the base iphone is 100 dollars and that\'s a pretty good deal honestly. just came up from a base 11. can\'t say i\'m disappointed. this base screen is much better and the battery is too, which is most of what i wanted.i got a new iphone 14 256gb for $700 for my dad. i hope that\'s a good deal lol.iphone x or 11 that can be had for less then $150 are still great phones. they also offer an almost identical user experience then the "latest and greatest".the best part about the iphone 14 is it makes the 13 cheaper.',
 'camera': "iphone 14 is incredible i love it the camera and cinematic mode is insane.watching on my iphone 14 plus in purple.i recently upgraded from the iphone 8 to the 13, and i could never go back the camera is just incredible and the speed of the phone is spectacular. definitely recommend the 13! but the 12 is also a great option too!@xyztuv its very nice, i like the camera and the overall functions. its a big upgrade from an iphone 6 to an iphone 14. :d personally, i don't feel the need to have more advanced camera settings. but if you're into photography and want better quality when it comes to taking pictures, you should go for the pro version.",
 'durable': 'yeah i knowfirst world problems but iphone 14 pro battery is just that good that i went from using lbm all the time on xs and 12 pm to never turning it on.im still using an iphone 6 and honestly im sticking wit it till it breaks. it had been through a lot and it still works and thats all that matters. so for those people that are upgrading like every year, its cool and amazing to have the latest phone but literally can do the same thing as the last phone you had. so my iphone 6 is still up n running and the camera quality is pretty good ngl, and also contacting family or friends is still the same thing. upgrading when my 6 broken broken lol.i had a iphone 8 for 6 years and i had to upgrade to a iphone 14. my iphone 14 is just perfect in my opinion.@laura warburton damn i miss my iphone 8, had it for years and it was a beast. loved the haptic home button.yes! im about to complete four years with my iphone 11 and the camera, the software, the battery are still great, i just regret not getting a bigger storage option but its amazing having that phone, my next goal is to get the pro version form the 13 or 14.people who still uses iphone 6s plus.',
 'durability': "hi,\ni'm currently using my iphone x that i bought back in december 2018 and the age is showing.bro i use an iphone se (old version) as my iphone 8 broke .. that makes me sound ungrateful but i am very grateful that i even have a phone!",
 'lightweight': 'when throwing the iphone 13: oh it was a light throw, nothing to worry.',
 'easy to use': '',
 'craftsmanship': 'iPhone renovado.aka the best iphone ever made.opinion: the iphone 13 pro and 13 pro max are better than the 14 and 14 plus because they have pro-motion, an extra camera, and a more premium build along with the same specs, and they can be gotten for slightly cheaper than they used to be.coming from an iphone 6s, this is amazing lol.',
 'easy to set up': '',
 'charging speed': '',
 'charges quickly': 'Yes I got my iPhone delivered today.',
 'water resistance': '',
 'wireless charging': '',
 'long battery life': "i read great reviews about the iphone 13 pro/pro max (camera, screen and also battery life)that's how apple makes us purchase higher storage iphone.i upgraded from the iphone x to the 14 and although i wish i went to the 14 pro, the 14 is still way faster, the camera is better, and the battery life has been solid all day!",
 'copmfortable to use': "Great iPhone.iPhone 13 Is A Great Choice Still.IPhone 13.iPhone 13 Pro Max.im upgrading from a 12 to this phone, not really an apple fan boy but to me i think the 14 will be just fine for my needs.maybe not so much of an upgrade coming from iphone 13. coming from an iphone xr this is a good upgrade! hahaha enjoying my iphone 14.ok i don't feel as bad as i replaced my iphone 12 promax for the iphone 14 promax.6gb ram is also quite an useful upgrade compared to the 4gb ram of iphone 13.",
 'temperature control': ''}

negative = {'heavy': 'how ios handles sound and sound silencing has always been so awful imo.never thought i\'d say this but ios is getting bloated.i\'ve been wanting this feature for so long and it\'s been on my ios wishlist for years.let\'s be honest, ios as a whole does the hardware that are modern iphones a giant disservice.same as an end user should not worry about installing drivers or editing registry entries.\nthe ballooning system storage suddenly taking up hundreds of gb for no reason is different, that seems to be an issue by apple, even though i haven\'t really heard of it anymore after ios 16 released.my biggest issue with my phone at the minute.\nmy mobile banking app is 90mb at download.yes, at 50% - i find it annoying that ios constantly interrupts when certain thresholds are being reached.although with my new iphone 14 pro max, i have never used it.apple sucks in general. i switched from a samsung s10e to apple one month ago and the iphone is a lot thicker, software options like eq or different keyboards suck because you need subscriptions for every shit. never again an iphone because they try to tie you up.that iphone 13 yeet hurt me.i\'m done with iphones.iphone will fall down soon.apple is a crap company, the iphone saved them from extinsion and now it\'s happening again lol.watching you throw that iphone 13 like that isn\'t very cool, a lot of people are watching these videos knowing they can\'t afford those products and dreaming about getting them at some point "one day". so that kind of materialistic "not caring" just doesn\'t fit well.apple hater.my heart dropped literally when he dropped that iphone 13 like it\'s nothing .iphone 11 is trash bro.honestly the iphone 13 was a super disappointing upgrade and this thing has a huge skip me printed on the box. what is going on here apple?!',
 'price': "hi all,\ni currently have an iphone 11 and i am looking to upgrade my phone as i have had it 3 years now\ni wanted the 13 pro but it's hard to get unless you pay for 1tb storage.\ni am now choosing between the standard 13 (around\n30pm) and the 14 pro (around 50pm)it will cost 65% of the rate of the iphone 13 new.based on the ios version, the storage likely isn't encrypted so you may be able to send it out for physical data recovery as well, it's just probably the priciest option.one big advantage of iphone 14 compared to 13 is the plus version, especially when in my country (cz) the 13 pro max 256 gb is over 200 usd more expensive compared to 128 gb 14 plus while 128 gb is more then enough (office use, no videos almost no photos) and of course 14 pro max is about 300 usd difference. the only drawback is older chip compared to 14 pro, but when they are at sunset, they both will be bad experience anyway and its likely they will be both unsupported in the same year anyway.apple sucks now, premium cost for old stuff that the competition does cheaper. fuek em.going from an iphone 7 to a 14.it seems like all iphones are gimmicks. always the same ios software and features. the cameras might get slightly better and the picture quality might get better but same battery and etc. i cant justify paying a premium for an iphone you for the apple name more than anything. i mean 1400$ for a 256gb iphone 14 lol you can buy an oled from samsung or sony for that same price and it'll probaly last you 100x longer. its just my opinion but no iphone has a 1400$ value. there not built to last but thats how marketing works they have to keep making the latest and greatest thing to keep buisness.tell me again how apple doesn't screw its users money-wise? the only thing i actually like from the iphone 14 is the fact that it can call emergency services in the event of an accident. other than that, there are far cheaper phones with more powerful components.my cousin just got a brand new iphone 14 yesterday i thought that's too expensive.i upgrade my iphone about every 3 or 4 years, i never bought the latest gen but always one or two gens older because price-performance is way better...<br>for instance, i got the 12 mini when the 13 series released, didn&#39;t regret it once and planning on using it until either the battery health gets significantly low or it doesn&#39;t get ios updates anymore... so it&#39;ll be probably another 2 to 3 years, making a total 3 to 4 years :)i would wait at least another year (iphone 15) i think for the price they're selling this at, it's not really worth the upgrade. the iphone 11 is a great phone.@arminperser a refurbished 128gb iphone x is like 600 bucks in turkey.basically an iphone 13c.the 14 was 949 in portugal these past days. the 13 was on sale too at 799. so normally between them without beeing on sale, in europe you wont find that big of a deal for a 13.. they cost almost the same, even a used one is like i&#39;m being robbed. so i went for the 14.<br>(also never owned an iphone before)hey marques, i'm due for an upgrade. with my current trade in, i can either get the iphone 14 or the 13 mini at no cost to me. any opinion on these options? i like the form factor of the 13 mini, but the smaller battery scares me a bit!",
 'camera': "right now she's using a 7 but wants to swap it to receive ios 16 features and a better camera and that worries me even more because chances are she'll also feel the same way with the x in not so long.the very unpleasant thing about iphone is that hp and others don't work with their video and picture files. so, if i want to make a photo book on my hp with the pictures made on iphone, i have to go through hell of a job of transforming the heic files to jpeg. and i am not going to use pictures made on samsung for photo books, because samsung does not invest enough money to make their cameras nearly as good as iphone cameras.i have an iphone 13 and the camera puts a filter after taking the picture. is for example an iphone 14 pro better?yep i'm upgrading from iphone 8 to 14 bc my 8 it at the end of its line - i'd keep using the 8 but my notifs don't work anymore and i can't fix them plus my camera is water damaged.your iphone supports the persecution of uyghurs!",
 'durable': "hi, i currently have the 2nd gen iphone se which has lasted pretty well for the last 3 years apart from the battery life which is appalling (needs recharging about 3x a day with my current use)i've been using a iphone 6s with a new battery for the past year.on my iphone before (maybe currently still.i'm on a 64gb iphone 8, and i still only use 40gb.i set it when i still used my iphone 7 with shit battery life and have just never got rid of it.this is why i upgrade every 3 or 4 years. i still use my iphone 2020 se & it does the job. ever since jobs passed away this is what there gonna continue to do. instead of trying to bring us something new, there gonna repackage the same old phone from the year before, but maybe a slightly better camera. whoopty whoop.i'm still packing the iphone 11. i wont see a new iphone for 5 more years as best. apple phones have become 100% stagnant.apples been doing this with the iphone for a decade lmao glad people are starting to figure out the scam.future proofed iphones don't exist and never will.i will buy an iphone 11 and upgrade to 14 after 3 years. then to 15 after another 3 years.lol literally no changes in the last few iphones. how pathetic and uninspired tim cook is....i'm still rocking my iphone 7. worked just fine until today. planning on getting the iphone 14 later today.",
 'durability': "my sister wants an xs max but i'm worried about the longevity.i used to use the files app back when i used an android all the time, and when i switched to ios i've used it a number of 2 or 3 times and got 0 use out of it.hey i own iphone 7 plus want a new secondary phone minimal usage but need more storage &amp; secure phone which can last at least ( years without issues preferably iphone which 1 ?<br>13 or 14 or 14 pro.why i use apple and the phone lifetime is very short. you have to mention that every 3 years you must change your iphone.<br>after i update the latest version for ios my iphone start have problems in calls and apple service reply change your phone.same here! my iphone 11 is starting to show its age (it's been laggy since the day i got it tbh) so i'm eyeing the iphone 14.",
 'lightweight': "Blue iPhone 12 mini.DOES NOT FIT ALL - JUST T-MOBILE.these iphones are so lame.he just threw the green iphone 14 in the air like its nothing.it's just an iphone 13 with a software update.little changes again for the base-line iphone! this is so disappointing. i was looking forward to some new features and changes but there wasn&#39;t anything of them. <br>and it&#39;s also disappointing to see apple having less passion to improve the base-line iphone, which is my favorite iphone line. almost same phones for 2 years in a row... this is too disappointing.",
 'easy to use': 'yoo bro you just drop the iphone like is nothing ,can you send me one.iphone air is like blow air from mouth what ever u think in ur head it is fully automatic.',
 'craftsmanship': "just got my iphone 14 product red- it's dope.iphone 25 pro max air no more iphone launching dates igurihollywood only changing designed in ca made in china.",
 'easy to set up': '',
 'charging speed': "Faulty charging port on phone.i've also set an automation for when iphone charges above 95%iphone 13 4gb ram. iphone 14 6gb ram...iphone 15 should come with a fast charge brick and air pod pros.",
 'charges quickly': "Teléfono prácticamente nuevo, sin rayones ni golpes.they been blowing up since ios 2.0.i just got iphone 14 plus it's hot is that bad it get warm.the little grimace as he chucked the iphone 13 over his shoulder.i went from a iphone 4s to a 5c, 5c to 5, 5 to 5s, 5s to 4, 4 to 6s, 6s to 7 plus, 7 plus to x.i ordered a iphone 14 today, can't wait to try all the features. :)i went from iphone 6 to iphone 13 so big upgrade for me too lol.they went downhill after the release of the iphone 11.went from a 7 plus running ios 10 to 13 pm last nov, had to figure out some things lol.the 13 just came out. why the fuck would apple release another phone just a few months later? apple is only doing cash grabs cause phones can only be so big or so small yet doing nothing. im not buying a new iphone till it's revolutionary. like the first iphone.",
 'water resistance': '',
 'wireless charging': '',
 'long battery life': 'because my iphone xr last 2 days like that.',
 'copmfortable to use': "what sold me on the iphone 14 was the crash detection since i was getting this for my mom and best have it and not need it than need it and not have it.yet for actual things that matter, ios is a fail. i.e. file system access, universal back button/gesture, notifications, walled software/hardware garden etc. android caught up and is just as smooth now.. apple hang onto this fluid thing but what is that exactly break it down.u got the same colour with my iphone 14.honestly, i only got it because my family switched to t-mobile, and it was free + tax. it was time for me to upgrade either way since i was rocking the iphone 11, so i don't hate it. in fact i like it a lot. i wanted the 13 anyways, they just didn't offer it as a free upgrade.",
 'temperature control': "but still, i'm gonna upgrade from iphone 8 into iphone 14 since i want a phone that easy to repair & better thermal so i won't find any fuss like i already have in iphone 8."}

### 6.2 Finding most highlighted issues (quantitative analysis)

The <b>get_outputs(Qualatative, positive, negative): </b> takes the output of the previous function and outputs the top 3 categories in a list. It then sees if each respective category was evaluated to be negative or positive and returns the corresponding collection of comments as a string. The strings are all returned in a list corresponding to the category in the first output.

In [None]:

def get_outputs(Qualatative, positive, negative):
    points_to_focus_on = Qualatative[:3]
    #points store the quality to focus on
    #extracted comments stores the comments on what has been said about the product
    points = []
    extracted_comments = []
    for x, y in points_to_focus_on:
        points.append(x)
        if y < 0: #means negative  so get  the negative word bank
            extracted_comments.append(negative[x])
        else:
            extracted_comments.append(positive[x])
    return points, extracted_comments


product = 'iphone'
x,y = get_outputs(Qualatative, positive, negative)
print(x)

This is the output for the iphone

In [13]:
x= ['heavy', 'charges quickly', 'durable']
y = ['how ios handles sound and sound silencing has always been so awful imo.never thought i\'d say this but ios is getting bloated.i\'ve been wanting this feature for so long and it\'s been on my ios wishlist for years.let\'s be honest, ios as a whole does the hardware that are modern iphones a giant disservice.same as an end user should not worry about installing drivers or editing registry entries.\nthe ballooning system storage suddenly taking up hundreds of gb for no reason is different, that seems to be an issue by apple, even though i haven\'t really heard of it anymore after ios 16 released.my biggest issue with my phone at the minute.\nmy mobile banking app is 90mb at download.yes, at 50% - i find it annoying that ios constantly interrupts when certain thresholds are being reached.although with my new iphone 14 pro max, i have never used it.apple sucks in general. i switched from a samsung s10e to apple one month ago and the iphone is a lot thicker, software options like eq or different keyboards suck because you need subscriptions for every shit. never again an iphone because they try to tie you up.that iphone 13 yeet hurt me.i\'m done with iphones.iphone will fall down soon.apple is a crap company, the iphone saved them from extinsion and now it\'s happening again lol.watching you throw that iphone 13 like that isn\'t very cool, a lot of people are watching these videos knowing they can\'t afford those products and dreaming about getting them at some point "one day". so that kind of materialistic "not caring" just doesn\'t fit well.apple hater.my heart dropped literally when he dropped that iphone 13 like it\'s nothing .iphone 11 is trash bro.honestly the iphone 13 was a super disappointing upgrade and this thing has a huge skip me printed on the box. what is going on here apple?!', "Teléfono prácticamente nuevo, sin rayones ni golpes.they been blowing up since ios 2.0.i just got iphone 14 plus it's hot is that bad it get warm.the little grimace as he chucked the iphone 13 over his shoulder.i went from a iphone 4s to a 5c, 5c to 5, 5 to 5s, 5s to 4, 4 to 6s, 6s to 7 plus, 7 plus to x.i ordered a iphone 14 today, can't wait to try all the features. :)i went from iphone 6 to iphone 13 so big upgrade for me too lol.they went downhill after the release of the iphone 11.went from a 7 plus running ios 10 to 13 pm last nov, had to figure out some things lol.the 13 just came out. why the fuck would apple release another phone just a few months later? apple is only doing cash grabs cause phones can only be so big or so small yet doing nothing. im not buying a new iphone till it's revolutionary. like the first iphone.", "hi, i currently have the 2nd gen iphone se which has lasted pretty well for the last 3 years apart from the battery life which is appalling (needs recharging about 3x a day with my current use)i've been using a iphone 6s with a new battery for the past year.on my iphone before (maybe currently still.i'm on a 64gb iphone 8, and i still only use 40gb.i set it when i still used my iphone 7 with shit battery life and have just never got rid of it.this is why i upgrade every 3 or 4 years. i still use my iphone 2020 se & it does the job. ever since jobs passed away this is what there gonna continue to do. instead of trying to bring us something new, there gonna repackage the same old phone from the year before, but maybe a slightly better camera. whoopty whoop.i'm still packing the iphone 11. i wont see a new iphone for 5 more years as best. apple phones have become 100% stagnant.apples been doing this with the iphone for a decade lmao glad people are starting to figure out the scam.future proofed iphones don't exist and never will.i will buy an iphone 11 and upgrade to 14 after 3 years. then to 15 after another 3 years.lol literally no changes in the last few iphones. how pathetic and uninspired tim cook is....i'm still rocking my iphone 7. worked just fine until today. planning on getting the iphone 14 later today."]

### 6.3 Getting suggestions for improvement based on comments and design prompt (qualitative analysis)

The <b> get_best_way_to_improve_quality(points, extracted_comments, product): </b> takes the output of the previous function as well as the products. It then creates a prompt to ask openai to summarise the paragraph and give the best way for a company to improve the product based on the comments.

The <b>def get_design_problem_statement(product, points): </b> takes the products as well as the top 3 categories and uses them to prompt openai to generate a design problem statement.

In [None]:
#This function then takes the points and extracted comments and ask chatgpt to summarise the best way to improve in the various categories
def get_best_way_to_improve_quality(points, extracted_comments, product):
    ways_to_improve = {}
    for i in range(len(points)):
        prompt = "According to the extracted comments, what is the best way for the company of the " + product + " to improve following aspect of the "+ product  + "? Aspect: " + points[i] + ". Extracted Comments: " + extracted_comments[i]
        response = ask_gpt(prompt)
        print(response)
        ways_to_improve[points[i]] = response
    return ways_to_improve

responsed = get_best_way_to_improve_quality(x,y, product)
print(responsed)

#This function will take the points and use openai to create a design prompt for the product using the points extracted previously
def get_design_problem_statement(product, points):
    pointers = ""
    for i in points:
        pointers = pointers + i
    prompt = "Create a Design problem statement to improve the " + product + " centering around the following qualities: " + pointers
    return ask_gpt(prompt)
    
response = get_design_problem_statement(product, x)

print(response)

This is the output for the iphone

In [None]:
responsed ={'heavy': 'The best way for the company of the iphone to improve the heaviness of the iphone is to reduce the size of the software and apps, and make sure that the hardware of the iphone is optimized to make it as lightweight as possible. Additionally, the company should focus on providing more features and options that do not require subscriptions, and should make sure that the software is not bloated with unnecessary features. Finally, the company should be more mindful of how their products are presented in videos and other media, as this can have a negative impact on people who cannot afford the products.', 'charges quickly': 'The best way for the company of the iphone to improve the charge time of the iphone is to focus on making the phone more efficient. This could include optimizing the battery, making the charging process faster, and improving the overall power management of the device. Additionally, the company should focus on making the phone more revolutionary, as customers are looking for something new and innovative.', 'durable': 'The best way for the company of the iphone to improve the durability of the iphone is to focus on creating more innovative and reliable products that are future-proofed and can last longer than 3-4 years. They should also focus on improving the battery life of their phones, as this is a major issue for many users. Additionally, they should strive to make meaningful changes to their phones with each new release, rather than simply repackaging the same old phone with a slightly better camera.'}
response ="Design a new iPhone that is durable, charges quickly, and is able to maintain its charge for longer periods of time. The design should also be able to withstand heavy usage and be able to charge quickly without compromising its durability."

## 7. Final Outputs and reflections

The final step is to take the reponses and consolidate them into a single output in the form of text file

The following code takes the responses collected and creates the <b> output.txt </b> file.

In [None]:
with open('output.txt', 'w') as f:
    f.write('This is the output for the product: ')  
    f.write(product+ '\n\n\n') 
    
    f.write("Output one: Product's weakest areas" + '\n')
    f.write('According to the comments analysed, the product is weakest in the following categories: ' + '\n\n')
    
    
    categories_to_improve = ""
    for i in range(len(x)):
        categories_to_improve = categories_to_improve + x[i]
        if i != len(x) - 1:
            categories_to_improve = categories_to_improve + ', '
            
    f.write(categories_to_improve + '\n\n')
    
    f.write('\n\n')

    
    f.write("Output two: Suggested means to improve the product's weakest areas:" + '\n')
    f.write('These are the suggestions for a company to improve in the following categories: ' + '\n\n')
    for j in responsed:
        f.write(j + " : " + responsed[j] + '\n\n' )
        
    f.write('\n\n')

    f.write("Output three: A design problem statement to start designers ideating on improve the product:" + '\n')
    f.write(response)
    

The following is the output for the iphone

This is the output for the product: iphone


Output one: Product's weakest areas
According to the comments analysed, the product is weakest in the following categories: 

heavy, charges quickly, durable



Output two: Suggested means to improve the product's weakest areas:
These are the suggestions for a company to improve in the following categories: 

heavy : The best way for the company of the iphone to improve the heaviness of the iphone is to reduce the size of the software and apps, and make sure that the hardware of the iphone is optimized to make it as lightweight as possible. Additionally, the company should focus on providing more features and options that do not require subscriptions, and should make sure that the software is not bloated with unnecessary features. Finally, the company should be more mindful of how their products are presented in videos and other media, as this can have a negative impact on people who cannot afford the products.

charges quickly : The best way for the company of the iphone to improve the charge time of the iphone is to focus on making the phone more efficient. This could include optimizing the battery, making the charging process faster, and improving the overall power management of the device. Additionally, the company should focus on making the phone more revolutionary, as customers are looking for something new and innovative.

durable : The best way for the company of the iphone to improve the durability of the iphone is to focus on creating more innovative and reliable products that are future-proofed and can last longer than 3-4 years. They should also focus on improving the battery life of their phones, as this is a major issue for many users. Additionally, they should strive to make meaningful changes to their phones with each new release, rather than simply repackaging the same old phone with a slightly better camera.



Output three: A design problem statement to start designers ideating on improve the product:
Design a new iPhone that is durable, charges quickly, and is able to maintain its charge for longer periods of time. The design should also be able to withstand heavy usage and be able to charge quickly without compromising its durability.

### Analysis and reflection

<b> 1. Time taken to run: </b> The time taken for each webscrapper and code to run is significant. While the use of collab and several computers running code simultaneously allowed for the process to be completed in a somewhat timely manner (36 hours across 2 computers for 72 hours (both running collab and one running vsc as well)). 

<b> 2. Dataset size: </b> Inspecting the csvs show that even though a valid output was produced not all the data was labelled and some platforms provided a much larger dataset than others. While this could be improved by running the code for a longer time, it would come at the expense of time.

<b> 3. Limitation of crowdsourcing </b> Although the information has all been collected from general platforms, the method of getting the general concensus may not be 100% correct. Users may not always know what they want. The data also does not segment based on region which could be critical for certain products.