# Enter Name here:

# Assignment

This process paper is a time for you to discuss the process you took personally to get to the final project. Why did you choose the data you chose? How well did it work for your question? Did you need (or want) to change your question as you evaluated your data more? What provided you with particular difficulties or struggles? How did you overcome those challenges? What parts were the easy parts?

This paper should be a minimum of 300 words. It should be turned in via a .ipynb file, so you can demonstrate some of the issues with code, or you can use it to show things you thought were particularly interesting, but might not have found their way into the final project.

Think of this as a reflection on the process you went through to get to the final project. You should be familiar with reflections at this point in your Queens career.

Begin writing in the cell below.


##Process Paper

When I first started this project, I wanted to work with a dataset that felt relevant to my everyday life. The Netflix Movies and TV Shows dataset immediately caught my attention because Netflix is such a major part of entertainment now. I decided my main question would be whether Netflix has shifted toward adding more recently released content over the years — something I was genuinely curious about.

The dataset seemed perfect at first glance, but as I worked with it, I realized there were a lot of technical issues that I had to solve before I could even get to the analysis. For example, one of the first problems was converting the 'date_added' column. Originally, I just tried:

```python
df['year_added'] = df['date_added'].year
```

which gave me an AttributeError because 'Series' object has no attribute 'year'.  
After researching and looking through the dataset again, I realized I had to convert the column to a datetime format first, and use .dt.year properly:

```python
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
df['year_added'] = df['date_added'].dt.year
```

Another serious issue came when I tried to run the Pearson correlation. Initially, I wrote:

```python
r, p = pearsonr(df['release_year'], df['year_added'])
```

but it threw an error because both columns had NaN values. Just dropping them wasn’t enough because the NaNs were causing type inconsistencies in the memory. I had to specifically **drop NaNs first**, **force the columns into numeric format**, and **verify their datatypes**:

```python
pearson_df = df[['release_year', 'year_added']].dropna()
pearson_df['release_year'] = pd.to_numeric(pearson_df['release_year'], errors='coerce')
pearson_df['year_added'] = pd.to_numeric(pearson_df['year_added'], errors='coerce')
```
After that,  Pearson worked correctly and the scatterplot showed me a real trend.

Another difficult part was dealing with the 'country' field for the international trends analysis. Countries were listed as comma-separated strings, so only a single show could be labeled under multiple countries. I had to split the string, split the rows into multiple entries, and filter only the top five countries dynamically using this:

```python
country_df = df.dropna(subset=['country', 'year_added'])
country_df['country'] = country_df['country'].str.split(', ')
country_expanded = country_df.explode('country')
top_countries = country_expanded['country'].value_counts().head(5).index
filtered = country_expanded[country_expanded['country'].isin(top_countries)]
```

This project definetly gave me a run for my money. I had to restructure the DataFrame entirely to get meaningful graphs out of it.

Looking back, the hardest parts were definitely the hidden issues with data types, missing values, and reshaping data frames properly for analysis. The easier parts were creating the visualizations once the data was clean — Seaborn made it simple to create clean scatterplots, boxplots, and bar charts.

I never ended up changing my research question because the dataset supported it so well. However, I did expand the scope by adding bonus analyses on content types and international expansion once I realized the dataset had more to offer.

Overall, this project made me realize that getting the data ready is often harder than the actual analysis. Cleaning, validating, and reshaping the data took way more time than running correlations or t-tests. I feel like I learned a lot more about real-world data challenges, and not just “running code,” but actually understanding why something breaks and how to fix it properly.
