**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Sheena Patel
- Anna Chen
- Shreya Velagala
- Catherine Du
- Esther Liu

# Research Question

<b> How does the relationship between TikTok fashion trends and their popularity as Google search terms inform the future of those trends? </b> <br>

In Depth RQ: <br>
How can we predict the level of interest in various TikTok fashion trends across New York and California using TikTok trend interest score data from January 1st, 2020 to December 31st, 2023 using a model that inputs the fashion trend name, monthly time data, and region to predict an interest rating between 0 - 100 and an associated label of 'low' (interest score < 25), 'rising' (interest score < 50), 'popular' (interest score < 75), and 'trending (interest score > 75) for the specified input in 2024? <br>

This question aims to develop a predictive model that evaluates the popularity of TikTok fashion trends in different regions and times, using a quantifiable interest rating system. The focus on a state-by-state analysis allows for a detailed understanding of regional preferences and trends overtime. <br>

We specifically focus on New York and California because they are the top two most populous American states and have the highest TikTok usage in the United States. When gathering data from Google Trends, we had to select states that displayed enough search queries for us to use.<br>

Interest Score/Interest Over Time Definition:  <br>
The "interest score" on Google Trends represents the relative popularity of a search query in a
specific region and time frame. It is indexed from 0 to 100, where 100 signifies the peak popularity
for the term. This score does not indicate the absolute search volume but rather shows the search term's popularity relative to the highest point on the chart for the given region and time. A higher score means more people are searching for that particular term at that time, while a lower score indicates lesser interest. The data is useful for identifying trends and understanding how interest in certain topics changes over time. <br>

## Background and Prior Work

Our team is curious to understand how current fashion trends can be understood to predict the rating of a piece of clothing. We aim to create a prediction model that can take types of clothing and dates as input to predict the rating of a piece of clothing. We hope this model will have real-world application by possibly allowing companies to predict ratings for different clothing pieces based on GenZ fashion to potentially improve clothing purchase rates. Many datasets exist that contain information about women’s clothing including product descriptions, reviews, and ratings. However, we would like to incorporate TikTok trends into our dataset to use TikTok fashion trends as a data feature when predicting the rating for clothing. 

Currently, we have a few datasets including Amazon Women's Fashion Dataset<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) and Women's Ecommerce Clothing Reviews<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) that seem relevant to our project goals. Based on our initial research, our ideal dataset would include columns for clothing type, clothing description, clothing review, clothing rating, and a label for each row representing what type of TikTok trend the clothing item falls into based on the clothing description. Additionally, we found other projects that have approached similar problems. One example is this project: Amazon Women's Clothing Review<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1). In the project, the researcher performs EDA by analyzing various trends in the sentiment for Amazon clothing reviews. We plan to incorporate a similar approach to our EDA by visualizing the trends in rating prediction, but analyzing prediction from the perspective of TikTok trends and views for different TikTok trends influencing rating. Another project that had a similar approach is Rating prediction<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3). This project performs rating prediction similar to what we would like to do. My team plans to use this project as an example when creating our prediction model and also use a similar approach for EDA using TF-IDF to assign TikTok trends to different clothing descriptions. 

The first project<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) found that the main trigrams from the reviews fall into the positive sentiment categories. For instance, the research found that fit true size, run true size, fit just right, love love love, fit like glove, usual wear size, and every time wear were some of the most prominent trigrams. This research conducts more of a sentiment analysis to show that reviews and ratings are more positive in their dataset. The second project<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) performed EDA and tried to analyze what kinds of age groups make what kinds of reviews and ratings. The research project found that layering, lounge and swim clothing tend to have the best reviews. The research also interestingly found that higher age groups had worse ratings. What we found interesting about this project is that TF-IDF Vectorization was used to convert the clothing descriptions into vectors to be passed as input to the prediction model. I think we are going to have to take a similar approach to our project.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Jaewook. (2022, October 13). Amazon reviews on Women Dresses. Kaggle. https://www.kaggle.com/code/jaewook704/amazon-reviews-on-women-dresses
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Agarap, Abien Fred, and Nicapotato. (2018, Feburary 3). Women’s e-Commerce Clothing Reviews. Kaggle. https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Matteoanzano. (2023, December 5). Customer review analysis with text mining. Kaggle. https://www.kaggle.com/code/matteoanzano111/customer-review-analysis-with-text-mining/input

# Hypothesis


These are a few possible hypotheses under exploration:
- TikTok trends that were more popular towards the end of quarantine (2019/2020), which we have data for, will be less popular in 2024 
- TikTok trends that were less popular in the end of quarantine (2019/2020), which we have data for, may be more popular in 2024. 
- TikTok trends that involve more seasonal clothing like festive clothing for Christmas or summer clothing like dresses and skirts will be popular in their respective seasons in 2024. For instance, skirts which are summer clothing pieces are more likley to be popular in summer 2024, while sweaters which are winter clothing items, are more likely to be popular in winter 2024. 
    - Trends with bright colors are more likely to have higher interest scores in the summer than in other seasons. 
    - Trends with knit material will have higher interest scores in winter and cotton material will have higher interest scores in summer.
    - Trends with more blouses will have higher interest scores in the summer and sweaters will have higher interest scores in winter. 
- TikTok trends that are considered to be the most popular of all time will continue to be very popular and have high interest scores throughout 2024. 
- TikTok trends that have been endorsed by celebrities and thus have had high interest scores in the past will continue to have high interest scores in 2024. 


# Data

## Data overview

We compiled a list of the top 30 TikTok trends from 1/1/2020 through 12/31/2023 using ChatGPT and external sources <a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) and used Google Trends to search up these trend names and understand how the interest score rating changed overtime for them. There, we observed the data for “Interest over time”—which represents search interest (a value of 0-100 to represent peak popularity) relative to the highest point on the chart for the given region (the United States) and time (our aforementioned time frame)---where users searching for this trend name also searched with related queries. We collected 30 datasets by searching up 30 different top TikTok fashion trends on Google Trends to form our overall dataset of ~6500 datapoints. <br>


We describe the datasets below. We will be combining all of these datasets using pd.concat and then cleaning all the merged data together. The datasets will be stacked in rows along axis = 1 and then cleaned. The code for cleaning and combining the datasets is below the dataset descriptions. We use the [Pytrends API](https://pypi.org/project/pytrends/) to gather each dataset in a for loop into a separate data structure and then combine all the datasets using a dataframe at the very end to be cleaned. <br>

Data Cleaning Procedure: <br>
The data that we began with was relatively tidy because we made sure to choose states ('CA' and 'NY) which have higher populations and higher TikTok usage so there were no null values for the interest score for the fashion trends we searched up. We initially also chose Texas as a state in our research, but Texas gave us nearly 3000 null values because the fashion trends were not searched enough within the time frame we chose to actually have interest scores on Google Trends, so we removed Texas from our analysis. We also did some pre-processing steps where for each TikTok trend we found, we searched up the keyword on Google Trends and saw whether the interest score graphs had an updward, downward, or rising and falling trend existed for the interest score on Google Trends. We removed all fashion trends that lacked a relationship between interest score overtime. We also removed all fashion trends that gave an error saying not enough values to be displayed. The main data cleaning we had to do involved merging the different datasets together and cleaning the data to be appropriate to pass into a Machine Learning model for prediction. Here are the various data cleaning steps we took: <br>
1. Merge all 30 datasets together 
2. Use .unique() to make sure all 30 trends were added to the dataframe  
3. Check if any null values exist, remove all data for fashion trends with null values 
4. Make sure we have >100 observations for each fashion trend, if not, expand the timeframe 
5. Rename the columns to be more clean 
6. Apply a function to make a column that transforms each interest score observation into a label: 'low', 'rising', 'popular', 'trending' so that these labels can be predicted along with the interest score 
7. Make sure labels have been assigned correctly 
8. Create a new column that converts dates to numeric months for the ML model 
9. Perform one hot encoding for the trends to be used for the ML model 
10. Remove the 'date' and 'trend' columns because they are no longer needed

<br>
<br>




1. <a name="cite_note-1"></a> [^](#cite_ref-1) Howell, Carolyn. "TikTok Fashion Trends 2023: Unveiling the Hottest Styles." High Social, 31 Aug. 2023, www.highsocial.com/resources/tiktok-fashion-trends-2023-unveiling-the-hottest-styles/.
2. <a name="cite_note-2"></a> [^](#cite_ref-2) "Top TikTok Fashion Trends of 2023 (So Far)." Sweety High, www.sweetyhigh.com/read/top-tiktok-fashion-trends-2023-040323. Accessed 23 Feb. 2024.



Before creating the datasets, we manually checked the distributions generated by google trends for each trend to check if there was enough data and to see if any of them has cyclic patterns. The google docs here contains the distribution of all the trends we ended up using: [Trend Distributions](https://docs.google.com/document/d/13n-aXE2DKkp8noNjx5u7LjmYcoc4Xv3isw8XZL0vnSA/edit)

<b> Dataset #1: Y2K </b><br>
Keyword: Y2K -> Keyword for one of the top TikTok Fashion Trends between 2019 and 2023<br>
Dataset Name: Y2K <br>
Link to the dataset: [Google Trends Keyword Y2K](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=y2k&hl=en)<br>
Number of observations: 209<br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))<br>
Description: The dataset, gathered using Pytrends API by searching the keyword 'Y2K' on Google Trends, is structured in a dataframe. It includes metrics like interest scores over time, with data types encompassing datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is required for converting datetime to integer month values and creating columns for trends using one-hot encoding. This preprocessing is essential for feeding the trends and months into a Machine Learning prediction model. Additionally, interest scores will be categorized into labels such as 'low', 'rising', 'popular', or 'trending' for model output. <br>


<b> Dataset #2: Cottagecore </b>

Keyword: Cottagecore -> Keyword for one of the top TikTok Fashion Trends between 2019 and 2023<br>
Dataset Name: Cottagecore<br>
Link to the dataset: [Google Trends Keyword Cottagecore](https://www.google.com/url?q=https://trends.google.com/trends/explore?date%3D2020-01-01%25202023-12-31%26geo%3DUS%26q%3Dcottagecore%26hl%3Den&sa=D&source=docs&ust=1708745400653624&usg=AOvVaw0LOapYV77MsRKuz7RaC-Y8)<br>
Number of observations: 209<br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The dataset, gathered using Pytrends API by searching the keyword 'Cottagecore' on Google Trends, is structured in a dataframe. It includes metrics like interest scores over time, with data types encompassing datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is required for converting datetime to integer month values and creating columns for trends using one-hot encoding. This preprocessing is essential for feeding the trends and months into a Machine Learning prediction model. Additionally, interest scores will be categorized into labels such as 'low', 'rising', 'popular', or 'trending' for model output. <br>


<b> Dataset #3: E-Girl </b>

Keyword: E-Girl -> Keyword for one of the top TikTok Fashion Trends between 2019 and 2023<br>
Dataset Name: E-Girl<br>
Link to the dataset: [Google Trends Keyword E-Girl](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=e-girl&hl=en)<br>
Number of observations: 209<br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))<br>
Description: The dataset, gathered using Pytrends API by searching the keyword 'E-Girl' on Google Trends, is structured in a dataframe. It includes metrics like interest scores over time, with data types encompassing datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is required for converting datetime to integer month values and creating columns for trends using one-hot encoding. This preprocessing is essential for feeding the trends and months into a Machine Learning prediction model. Additionally, interest scores will be categorized into labels such as 'low', 'rising', 'popular', or 'trending' for model output. <br>


<b> Dataset #4: Vintage Thrift </b>

Keyword: Vintage Thrift  -> Keyword for one of the top TikTok Fashion Trends between 2019 and 2023<br>
Dataset Name: Vintage Thrift <br>
Link to the dataset: [Google Trends Keyword Vintage Thrift](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=vintage%20thrift&hl=en)<br>
Number of observations: 209<br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))<br>
Description: The dataset, gathered using Pytrends API by searching the keyword 'Vintage Thrift' on Google Trends, is structured in a dataframe. It includes metrics like interest scores over time, with data types encompassing datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is required for converting datetime to integer month values and creating columns for trends using one-hot encoding. This preprocessing is essential for feeding the trends and months into a Machine Learning prediction model. Additionally, interest scores will be categorized into labels such as 'low', 'rising', 'popular', or 'trending' for model output. <br>


<b> Dataset #5: Fairycore  </b>

Keyword: Fairycore -> Keyword for one of the top TikTok Fashion Trends between 2019 and 2023<br>
Dataset Name: Fairycore<br>
Link to the dataset: [Google Trends Keyword Fairycore](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=fairycore&hl=en)<br>
Number of observations: 209<br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The dataset, gathered using Pytrends API by searching the keyword 'Fairycore' on Google Trends, is structured in a dataframe. It includes metrics like interest scores over time, with data types encompassing datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is required for converting datetime to integer month values and creating columns for trends using one-hot encoding. This preprocessing is essential for feeding the trends and months into a Machine Learning prediction model. Additionally, interest scores will be categorized into labels such as 'low', 'rising', 'popular', or 'trending' for model output. <br>


<b> Dataset #6: Vanilla Girl </b>

Keyword: Vanilla Girl -> Keyword for one of the top TikTok Fashion Trends between 2019 and 2023<br>
Dataset Name: Vanilla Girl <br> 
Link to the dataset: [Google Trends Keyword Vanilla Girl](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=vanilla%20girl&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California)) <br>
Description: The dataset, gathered using Pytrends API by searching the keyword 'Vanilla Girl' on Google Trends, is structured in a dataframe. It includes metrics like interest scores over time, with data types encompassing datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is required for converting datetime to integer month values and creating columns for trends using one-hot encoding. This preprocessing is essential for feeding the trends and months into a Machine Learning prediction model. Additionally, interest scores will be categorized into labels such as 'low', 'rising', 'popular', or 'trending' for model output. <br>


<b> Dataset #7: Clean Girl Aesthetic </b>

Keyword: Clean Girl Aesthetic -> Keyword for one of the top TikTok Fashion Trends between 2019 and 2023 <br>
Dataset Name: Clean Girl Aesthetic <br>
Link to the dataset: [Google Trends Keyword Clean Girl Aesthetic](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=clean%20girl%20aesthetic&hl=en)  <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The dataset, sourced using Pytrends API by searching 'Clean Girl Aesthetic' on Google Trends, is formatted as a dataframe. It features metrics such as interest scores over time, including datetime, integer scores, and regional strings ('CA', 'NY'). It requires data cleaning for transforming datetime into integer month values and for the creation of trend columns using one-hot encoding. This process is vital for integrating the trends and months into a Machine Learning prediction model. Further, the interest scores will be labeled as 'low', 'rising', 'popular', or 'trending' for the model's output. <br>


<b> Dataset #8: Blokecore  </b>

Keyword: Blokecore -> Keyword for one of the top TikTok Fashion Trends between 2019 and 2023
Dataset Name: Blokecore <br>
Link to the dataset: [Google Trends Keyword Blokecore](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=clean%20girl%20aesthetic&hl=en)   <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The dataset, compiled using the Pytrends API with 'Blokecore' as the search keyword on Google Trends, is presented in a dataframe format. It includes metrics such as interest scores over time, consisting of datetime, integer scores, and regional strings ('CA', 'NY'). Necessary data cleaning involves converting datetime to integer month values and adding columns for trends via one-hot encoding. This preparation is crucial for incorporating the trends and months into a Machine Learning prediction model. Interest scores will also be classified under labels like 'low', 'rising', 'popular', or 'trending' for the model's output. <br>


<b> Dataset #9: Barbie Challenge </b>

Keyword: Barbie Challenge -> Top TikTok Fashion Trend between 2019 and 2023 <br>
Dataset Name: Barbie Challenge <br>
Link to the dataset: [Google Trends Keyword Barbie Challenge](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=barbie%20challenge&hl=en)   <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: This dataset, sourced via Pytrends API with 'Barbie Challenge' from Google Trends, is in a dataframe format. It includes interest scores over time, featuring datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning involves converting datetime to integer month values and creating trend columns using one-hot encoding for integration into a Machine Learning prediction model. Interest scores are labeled as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #10: Shirt Jackets </b>

Keyword: Shirt Jackets -> Top TikTok Fashion Trend between 2019 and 2023 <br>
Dataset Name: Shirt Jackets <br>
Link to the dataset: [Google Trends Keyword Shirt Jackets](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=shirt%20jackets&hl=en)  <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: Gathered using the Pytrends API, the 'Shirt Jackets' dataset from Google Trends is in dataframe form. It contains metrics like interest scores, datetime, integer scores, and regional strings ('CA', 'NY'). The data requires cleaning for datetime conversion into integer months and for trend columns addition via one-hot encoding, essential for Machine Learning prediction model input. Interest scores will be classified as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #11: Balletcore </b>

Keyword: Balletcore -> Top TikTok Fashion Trend between 2019 and 2023 <br>
Dataset Name: Balletcore <br>
Link to the dataset: [Google Trends Keyword Balletcore](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Balletcore&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Balletcore dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #12: Coastal Grandmother </b>

Keyword: Coastal Grandmother -> Top TikTok Fashion Trend between 2019 and 2023 <br>
Dataset Name: Coastal Grandmother <br>
Link to the dataset: [Google Trends Keyword Coastal Grandmother](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Coastal%20Grandmother&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Coastal Grandmother dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #13: Gingham </b>

Keyword: Gingham -> Top TikTok Fashion Trend between 2019 and 2023 <br>
Dataset Name: Gingham <br>
Link to the dataset: [Google Trends Keyword Gingham](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Gingham&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Gingham dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #14: Maxi Skirts </b>

Keyword: Maxi Skirts -> Top TikTok Fashion Trend between 2019 and 2023 <br>
Dataset Name: Maxi Skirts <br>
Link to the dataset: [Google Trends Keyword Maxi Skirts](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Maxi%20Skirts&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Maxi Skirts dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #15: Corset </b>

Keyword: Corset  <br>
Dataset Name: Corset <br>
Link to the dataset: [Google Trends Keyword Corset](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Corset&hl=en)  <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Corset dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #16: Leg Warmers </b>

Keyword: Leg Warmers <br>
Dataset Name: Leg Warmers <br>
Link to the dataset: [Google Trends Keyword Leg Warmers](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Leg%20Warmers&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Leg Warmers dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #17: Birkenstocks </b>

Keyword: Birkenstocks  <br>
Dataset Name: Birkenstocks <br>
Link to the dataset: [Google Trends Keyword Birkenstocks](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Birkenstocks&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Birkenstocks dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #18: Cloud Slides </b> <br>

Keyword: Cloud Slides <br>
Dataset Name: Cloud Slides <br>
Link to the dataset: [Google Trends Keyword Cloud Slides](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Cloud%20Slides&hl=en) <br>
Number of observations: 209  <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Cloud Slides dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #19: Leather </b>

Keyword: Leather <br>
Dataset Name: Leather  <br>
Link to the dataset: [Google Trends Keyword Leather](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Leather&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Leather dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #20: Funky Pants </b>

Keyword: Funky Pants <br>
Dataset Name: Funky Pants <br>
Link to the dataset: [Google Trends Keyword Funky Pants](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Funky%20Pants&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Funky Pants dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #21: Sweater Vests </b>

Keyword: Sweater Vests <br>
Dataset Name: Sweater Vests <br>
Link to the dataset: [Google Trends Keyword Sweater Vests]() <br>
Number of observations: 209  <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Sweater%20Vests&hl=enYork, California))

Description: The Sweater Vests dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #22: Linen Pants </b>

Keyword: Linen Pants <br>
Dataset Name: Linen Pants <br>
Link to the dataset: [Google Trends Keyword Linen Pants](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Linen%20Pants&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Linen Pants dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #23: Tube Tops </b>

Keyword: Tube Tops <br>
Dataset Name: Tube Tops <br>
Link to the dataset: [Google Trends Keyword Tube Tops](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Tube%20Tops&hl=en)  <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Tube Tops dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #24: Baggy pants </b>

Keyword: Baggy pants <br>
Dataset Name: Baggy pants  <br>
Link to the dataset: [Google Trends Keyword Baggy pants](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Baggy%20pants&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Baggy pants dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #25: Low-rise </b>

Keyword: Low-rise <br>
Dataset Name: Low-rise <br> 
Link to the dataset: [Google Trends Keyword Low-rise](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Low-rise&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Low-rise dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #26: Crochet  </b>

Keyword: Crochet <br>
Dataset Name: Crochet <br>
Link to the dataset: [Google Trends Keyword Crochet](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Crochet&hl=en) <br>
Number of observations: 209  <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Crochet dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #27: Platform Sandals </b>

Keyword: Platform Sandals <br>
Dataset Name: Platform Sandals <br>
Link to the dataset: [Google Trends Keyword Platform Sandals](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Platform%20Sandals&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Platform Sandals dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #28: Tomato Girl </b>

Keyword: Tomato Girl <br>
Dataset Name: Tomato Girl <br>
Link to the dataset: [Google Trends Keyword Tomato Girl](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Tomato%20Girl&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Tomato Girl dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #29: Soft Girl Aesthetic </b>

Keyword: Soft Girl Aesthetic <br>
Dataset Name: Soft Girl Aesthetic <br>
Link to the dataset: [Google Trends Keyword Soft Girl Aesthetic](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Soft%20Girl&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Soft Girl Aesthetic dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>


<b> Dataset #30: Mermaid Core </b>

Keyword: Mermaid Core <br>
Dataset Name: Mermaid Core <br>
Link to the dataset: [Google Trends Keyword Mermaid Core](https://trends.google.com/trends/explore?date=2020-01-01%202023-12-31&geo=US&q=Mermaid%20Core&hl=en) <br>
Number of observations: 209 <br>
Number of variables: 2 (Interest Score, Trend, 2 regions (New York, California))
<br>
Description: The Mermaid Core dataset, obtained through Pytrends API from Google Trends, is structured in a dataframe. It tracks interest scores over time and includes datetime, integer scores, and regional strings ('CA', 'NY'). Data cleaning is necessary to transform datetime into integer months and to create trend-specific columns through one-hot encoding. This process aids in preparing the data for a Machine Learning prediction model. Interest scores are to be categorized as 'low', 'rising', 'popular', or 'trending'. <br>

In [1]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION
%pip install pytrends
%pip install matplotlib

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Relevant Imports
from pytrends.request import TrendReq
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
# List of Top 30 TikTok trends that we will iterate through
TikTokTrends = ["Y2K", "Cottagecore", "E-Girl", "Vintage Thrift", "Fairycore", "Vanilla Girl",
                "Clean Girl Aesthetic", "Blokecore", "Barbie Challenge", "Shirt Jackets", "Balletcore",
               "Coastal Grandmother", "Gingham", "Maxi Skirts", "Corset", "Leg Warmers", "Birkenstocks", "Cloud Slides",
               "Leather", "Funky Pants", "Sweater Vests", "Linen Pants", "Tube Tops", "Baggy pants", "Low-rise", "Crochet",
               "Platform Sandals", "Tomato Girl", "Soft Girl Aesthetic", "Mermaid Core"]
# Multiple dataframes where one dataframe per trend will be in this list
df_per_trend = []

In [4]:
# For loop that iterates through all top 27 TikTok Trends from 2020 to 2023 and populates a
# list (df_per_trend) with interest scores for each trend.
for trend in TikTokTrends:
    # google trends df
    # Initialize pytrends
    pytrends = TrendReq(hl='en-US', tz=360)

    # Define the keyword and timeframe
    kw_list = []
    kw_list.append(trend)
    print(kw_list)
    timeframe = '2020-01-01 2023-12-31'

    # Define geographic locations
    geo_locations = ['US-CA', 'US-NY']  # California, New York, DEFAULT (ALL of US when state not specified)

    # Dictionary to hold data
    trends_data = {}

    # Fetching the data for each location
    for geo in geo_locations:
        pytrends.build_payload(kw_list, cat=0, timeframe=timeframe, geo=geo, gprop='')
        data = pytrends.interest_over_time()
        if not data.empty:
            trends_data[geo] = data[trend]

    # Combine data from different regions into one DataFrame
    df_curr_trend = pd.concat(trends_data, axis=1)

    df_curr_trend = df_curr_trend.reset_index()
    df_curr_trend["trend"] = trend
    df_per_trend.append(df_curr_trend)


['Y2K']
['Cottagecore']
['E-Girl']
['Vintage Thrift']
['Fairycore']
['Vanilla Girl']
['Clean Girl Aesthetic']
['Blokecore']
['Barbie Challenge']
['Shirt Jackets']
['Balletcore']
['Coastal Grandmother']
['Gingham']
['Maxi Skirts']
['Corset']
['Leg Warmers']


ReadTimeout: HTTPSConnectionPool(host='trends.google.com', port=443): Read timed out. (read timeout=2)

In [None]:
# Merging dataframes for all trends together
df_all_trends = pd.concat(df_per_trend, ignore_index=True)

In [None]:
na = df_all_trends['US-CA'].isna().any()
# False
na = df_all_trends['US-NY'].isna().any()
# False
print(na)
print(df_all_trends['US-CA'].dtypes)
# int64
print(df_all_trends['US-NY'].dtypes)
# int64

False
int64
int64


In [None]:
df_all_trends.head()

Unnamed: 0,date,US-CA,US-NY,trend
0,2020-01-05,13,17,Y2K
1,2020-01-12,15,12,Y2K
2,2020-01-19,11,15,Y2K
3,2020-01-26,9,17,Y2K
4,2020-02-02,8,20,Y2K


In [None]:
len(df_all_trends)
'''
5643
'''
df_all_trends["trend"].value_counts()
'''
Y2K                     209
Corset                  209
Crochet                 209
Low-rise                209
Baggy pants             209
Tube Tops               209
Linen Pants             209
Sweater Vests           209
Funky Pants             209
Leather                 209
Cloud Slides            209
Birkenstocks            209
Leg Warmers             209
Maxi Skirts             209
Cottagecore             209
Gingham                 209
Coastal Grandmother     209
Balletcore              209
Shirt Jackets           209
Barbie Challenge        209
Blokecore               209
Clean Girl Aesthetic    209
Vanilla Girl            209
Fairycore               209
Vintage Thrift          209
E-Girl                  209
Platform Sandals        209
'''

'\nY2K                     209\nCorset                  209\nCrochet                 209\nLow-rise                209\nBaggy pants             209\nTube Tops               209\nLinen Pants             209\nSweater Vests           209\nFunky Pants             209\nLeather                 209\nCloud Slides            209\nBirkenstocks            209\nLeg Warmers             209\nMaxi Skirts             209\nCottagecore             209\nGingham                 209\nCoastal Grandmother     209\nBalletcore              209\nShirt Jackets           209\nBarbie Challenge        209\nBlokecore               209\nClean Girl Aesthetic    209\nVanilla Girl            209\nFairycore               209\nVintage Thrift          209\nE-Girl                  209\nPlatform Sandals        209\n'

In [None]:
# Renaming Columns
df_all_trends.rename(columns={'US-CA': 'US-CA_Interest_Score', 'US-NY': 'US-NY_Interest_Score'}, inplace=True)
df_all_trends.head()

Unnamed: 0,date,US-CA_Interest_Score,US-NY_Interest_Score,trend
0,2020-01-05,13,17,Y2K
1,2020-01-12,15,12,Y2K
2,2020-01-19,11,15,Y2K
3,2020-01-26,9,17,Y2K
4,2020-02-02,8,20,Y2K


In [None]:
# Data cleaning to assign labels 'low', 'rising', 'popular', and 'trending' to each datapoint.
# This will be the output ofr the prediction
def bin_interest_score(num):
    if num < 25:
        return 'low'
    elif num < 50:
        return 'rising'
    elif num < 75:
        return 'popular'
    else:
        return 'trending'

df_all_trends['US-CA_Interest_Label'] = df_all_trends['US-CA_Interest_Score'].apply(bin_interest_score)
df_all_trends['US-NY_Interest_Label'] = df_all_trends['US-NY_Interest_Score'].apply(bin_interest_score)

In [None]:
df_all_trends.iloc[5200:5600]

Unnamed: 0,date,US-CA_Interest_Score,US-NY_Interest_Score,trend,US-CA_Interest_Label,US-NY_Interest_Label
5200,2023-07-16,32,13,Low-rise,rising,low
5201,2023-07-23,0,0,Low-rise,low,low
5202,2023-07-30,33,15,Low-rise,rising,low
5203,2023-08-06,80,0,Low-rise,trending,low
5204,2023-08-13,26,13,Low-rise,rising,low
...,...,...,...,...,...,...
5595,2023-02-05,29,32,Platform Sandals,rising,rising
5596,2023-02-12,42,28,Platform Sandals,rising,rising
5597,2023-02-19,45,37,Platform Sandals,rising,rising
5598,2023-02-26,35,55,Platform Sandals,rising,popular


In [None]:
# Extract the month and create a new column to be a column of numeric months
df_all_trends['date'] = pd.to_datetime(df_all_trends['date'])
df_all_trends.insert(1, 'month', df_all_trends['date'].dt.month)
df_all_trends.head()

Unnamed: 0,date,month,US-CA_Interest_Score,US-NY_Interest_Score,trend,US-CA_Interest_Label,US-NY_Interest_Label
0,2020-01-05,1,13,17,Y2K,low,low
1,2020-01-12,1,15,12,Y2K,low,low
2,2020-01-19,1,11,15,Y2K,low,low
3,2020-01-26,1,9,17,Y2K,low,low
4,2020-02-02,2,8,20,Y2K,low,low


In [None]:
# One Hot Encoding for Trends
TikTokTrends = ["Y2K", "Cottagecore", "E-Girl", "Vintage Thrift", "Fairycore", "Vanilla Girl",
                "Clean Girl Aesthetic", "Blokecore", "Barbie Challenge", "Shirt Jackets", "Balletcore",
               "Coastal Grandmother", "Gingham", "Maxi Skirts", "Corset", "Leg Warmers", "Birkenstocks", "Cloud Slides",
               "Leather", "Funky Pants", "Sweater Vests", "Linen Pants", "Tube Tops", "Baggy pants", "Low-rise", "Crochet",
               "Platform Sandals", "Tomato Girl", "Soft Girl Aesthetic", "Mermaid Core"]

for trend in TikTokTrends:
    df_all_trends[trend] = [1 if value == trend else 0 for value in df_all_trends['trend']]

In [None]:
df_all_trends.head()

Unnamed: 0,date,month,US-CA_Interest_Score,US-NY_Interest_Score,trend,US-CA_Interest_Label,US-NY_Interest_Label,Y2K,Cottagecore,E-Girl,...,Sweater Vests,Linen Pants,Tube Tops,Baggy pants,Low-rise,Crochet,Platform Sandals,Tomato Girl,Soft Girl Aesthetic,Mermaid Core
0,2020-01-05,1,13,17,Y2K,low,low,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2020-01-12,1,15,12,Y2K,low,low,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2020-01-19,1,11,15,Y2K,low,low,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2020-01-26,1,9,17,Y2K,low,low,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2020-02-02,2,8,20,Y2K,low,low,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_all_trends.columns

Index(['date', 'month', 'US-CA_Interest_Score', 'US-NY_Interest_Score',
       'trend', 'US-CA_Interest_Label', 'US-NY_Interest_Label', 'Y2K',
       'Cottagecore', 'E-Girl', 'Vintage Thrift', 'Fairycore', 'Vanilla Girl',
       'Clean Girl Aesthetic', 'Blokecore', 'Barbie Challenge',
       'Shirt Jackets', 'Balletcore', 'Coastal Grandmother', 'Gingham',
       'Maxi Skirts', 'Corset', 'Leg Warmers', 'Birkenstocks', 'Cloud Slides',
       'Leather', 'Funky Pants', 'Sweater Vests', 'Linen Pants', 'Tube Tops',
       'Baggy pants', 'Low-rise', 'Crochet', 'Platform Sandals', 'Tomato Girl',
       'Soft Girl Aesthetic', 'Mermaid Core'],
      dtype='object')

In [None]:
# Drop date and trend columns
df_all_trends = df_all_trends.drop(['date', 'trend'], axis=1)

In [None]:
df_all_trends.columns

Index(['month', 'US-CA_Interest_Score', 'US-NY_Interest_Score',
       'US-CA_Interest_Label', 'US-NY_Interest_Label', 'Y2K', 'Cottagecore',
       'E-Girl', 'Vintage Thrift', 'Fairycore', 'Vanilla Girl',
       'Clean Girl Aesthetic', 'Blokecore', 'Barbie Challenge',
       'Shirt Jackets', 'Balletcore', 'Coastal Grandmother', 'Gingham',
       'Maxi Skirts', 'Corset', 'Leg Warmers', 'Birkenstocks', 'Cloud Slides',
       'Leather', 'Funky Pants', 'Sweater Vests', 'Linen Pants', 'Tube Tops',
       'Baggy pants', 'Low-rise', 'Crochet', 'Platform Sandals', 'Tomato Girl',
       'Soft Girl Aesthetic', 'Mermaid Core'],
      dtype='object')

In [None]:
df_all_trends.describe()

NameError: name 'df_all_trends' is not defined

# Ethics & Privacy

Are there any biases/privacy/terms of use issues with the data you propsed?
- Our data from google trends doesn't contain privacy concerns for the people they collected the data from, as it is a publicly avaliable resource that does not release any info on users. However there are biases in regard to the time of the data we gathered. Though TikTok was released in 2016, the data we are utilizing is only from 2020, which would show results that are highly catered towards recent effects of TikTok, rather than its overall lifespan.

Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?) How will you set out to detect these specific biases before, during, and after/when communicating your analysis?
-  The 2 locations we are looking into are CA and NY, which both have large populations and have high Tiktok usage. It's important to recognize that our dataset represents users who actively engage with TikTok and have consistent internet access, so it does not account for places where trends have little effect. While acknowledging the influence of additional variables such as zoning and access to clothing stores on sales, we hope that our data could provide valuable suggestions for companies aiming to align their clothing offerings with popular social media trends.

How will you handle issues you identified?
- In order to handle issues related to biases and privacy in our datasets, we will emphasize that the trends we seek to predict and their impacts apply primarily to the states where this data is collected. Additionally, it may be hard to address issues of racial representation in the dataset, or biases towards specific bodies in relation to clothes trends. Although these issues may arise, our best course of action is to research and incorporate ethically sourced data. Our focus on choosing quality datasets from the start will ensure our analysis can be ethically predictive of the population.
- For issues with locations, we want to make sure to prefaced out results with the locations we used in order to areas that are less affected by TikTok trends do not rely strongly on our results. 

# Team Expectations 

* Regular meetings/checkups every Tuesday at 7pm
* Use discord and messages as main form of communication
* If issues with communication arises, attempt to solve them as a team
  * If a team member is non-responsive after that, contact the professor
* Divide up responsibilities so that everyone can contribute
* Be open when you are struggling to complete a task


# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/31 | 11 AM  | N/A  | Previous Project Review | 
| 2/5  | 10 PM  | Previous Project Review | Potential Project Proposal topics | 
| 2/8  | 7  PM  | Google doc of potential topics and their relevant datasets | Choosing Project Proposal topic. Assigning roles to complete project proposals by deadline |
| 2/16 | 2  PM  | Project Proposal | How to approach web scraping and obtaining the datasets that we need. (tiktok api) Discuss assigning roles for the implementation of the data analysis |
| 2/21 | 5  PM  | Look into PyTrends API and google trend results from different Tiktok Trends. | Share information and create a sample dataset for 1 trend as the base for the rest of the trends. Assign each team member a role to complete Project Checkpoint #1. |
| 2/24 | 3  PM  | Prepared clean data<br> Project Checkpoint #1: <ul><li>revise research question/report</li><li>specify the datasets we will use and why</li><li>explain our data cleaning process</li></ul> | Checkup on progress, make sure everyone is confident in their tasks |
| 3/5  | 7  PM  | Each team member's tasks  | Checkup on progress, make sure everyone is confident in their tasks |
| 3/12 | 7  PM  | The data prediction model is nearing conclusion, time to analyze findings | Outlining the conclusion of the project. Distributing tasks for the final project deliverable with relevant findings |