# Assignment #2: Data Exploration

In this assignment, you will demonstrate your knowledge of the Python and Pandas skills we've learned so far. These include:

- Getting an overview of your data
- Extracting columns
- Removing duplicates from your data
- Creating a subset of your data by matching a string
- Getting a random sample of your data
- Sorting your data 
- Getting fundamental statistical information of a series, such as the mean and median

You will not need to formulate or answer a research question for this assignment. You simply need to demonstrate your ability to perform specific operations on the data set you have selected. In the next assignment, your midterm, you will create a narrative exploration of the data using many of the same methods, but in such a way as to use narrative along with data analysis to answer a research question.

You may use the data set you imported in assignment #1, or you may choose another data set. That means that you can reuse code you wrote in Assignment #1 to import the data.

In the below notebook, I will ask for you to perform a task. Each task will be followed by one or more blank code cells. Use as many cells as you need to complete the task. If you know some other way to complete the task with Python and/or Pandas that is not the way we learned in class, that is also acceptable. Feel free to use Google searches to refresh your memory as to how to perform each task. You will also likely wish to consult the class notebooks, which can be found in [our class repository on GitHub](https://github.com/sha256rma/foundations-of-data-science).

If your data set does not have a needed form of data for a task, write code that would perform the task if a column with that data existed, and provide an explanation as a comment or markdown cell.

You may receive partial credit for some incomplete or incorrect answers. Please feel free to add comments about your thought process, which will make it more likely that you will receive partial credit.

The last cell is a bonus. You will not be penalized for not attempting or completing it.

-----

Import Pandas and read your data in as a data frame. Assign the data frame to a variable, such as `df`.

In [6]:
import pandas as pd

In [8]:
df = pd.read_csv('top_popular_movies.csv')

Make the data frame visible by placing the data frame variable in a cell by itself.

In [10]:
df.columns.tolist()

['title', 'release_date', 'popularity', 'vote_average']

What columns are in your data frame? Use a function or method that shows all the columns.

In [12]:
num_rows = df.shape[0]
print(f"Number of rows in the DataFrame: {num_rows}")

Number of rows in the DataFrame: 500


Write Python code to output the number of rows in your data frame.

In [14]:
titles = df['title']
titles

0              Avatar: The Way of Water
1      Winnie the Pooh: Blood and Honey
2                          Cocaine Bear
3                  John Wick: Chapter 4
4          Puss in Boots: The Last Wish
                     ...               
495                           The Flash
496                                Nope
497                         San Andreas
498                           I Believe
499                              Frozen
Name: title, Length: 500, dtype: object

Write code to output a column from your data frame as a series. (Extract a column.)

### Read the example below to **count** duplicates based on specific columns (e.g., 'Name' and 'Age')

Check duplicates based on 'Name' and 'Age' columns (modify the subset for your own columns):

`duplicates_subset = df.duplicated(subset=['Name', 'Age'])`

Count the number of duplicates in the subset:

`num_duplicates_subset = duplicates_subset.sum()`

Display no. of duplicate rows based on your column(s):

`print(f"\nNumber of duplicate rows based on 'Name' and 'Age': {num_duplicates_subset}")`


After reading the above, modify the code snippets to check for duplicate columns in your own data frame

In [17]:
duplicates_subset = df.duplicated(subset=['title'])

In [19]:
num_duplicates_subset = duplicates_subset.sum()

In [21]:
print(f"\nNumber of duplicate rows based on 'title': {num_duplicates_subset}")


Number of duplicate rows based on 'title': 3


If any duplicates exist within a column, remove them with the `.drop_duplicates()` function

In [23]:
df_cleaned = df.drop_duplicates(subset=['title'])

Write Python code to output the number of rows in your data frame.

In [25]:
new_num_rows = df_cleaned.shape[0]

Compare the new value with the first time you outputted the number of rows above. Is it more, less, or the same?

print(f"Number of rows after removing duplicates: {new_num_rows}")

Create a subset of the data that matches a specific string in a column. That is, extract all rows of the original data frame that contain a specific string in one of the columns. Save the resulting data frame to a variable. (Add more cells if needed)

In [28]:
print(f"Original number of rows: {num_rows}, Cleaned number of rows: {new_num_rows}")

Original number of rows: 500, Cleaned number of rows: 497


Output the length of the resulting data frame (the subset of your data for which the conditional was true).

In [36]:
print("First 10 rows:\n", df.head(10))

First 10 rows:
                                    title release_date  popularity  \
0               Avatar: The Way of Water   2022-12-14    6789.789   
1       Winnie the Pooh: Blood and Honey   2023-01-27    3258.540   
2                           Cocaine Bear   2023-02-22    2781.045   
3                   John Wick: Chapter 4   2023-03-22    2803.482   
4           Puss in Boots: The Last Wish   2022-12-07    1531.792   
5  Prizefighter: The Life of Jem Belcher   2022-06-30    1400.798   
6                   The Devil Conspiracy   2023-01-13    1280.108   
7                     Knock at the Cabin   2023-02-01    1192.041   
8         Black Panther: Wakanda Forever   2022-11-09    1203.240   
9                               Cazadora   2023-01-19    1084.456   

   vote_average  
0           7.7  
1           5.9  
2           6.6  
3           8.2  
4           8.3  
5           6.2  
6           6.4  
7           6.4  
8           7.3  
9           6.6  


Output the first ten rows of your data set, the last ten rows of your data set, and a random ten rows of your data set (a sample of your data).

In [38]:
print("\nLast 10 rows:\n", df.tail(10))


Last 10 rows:
                    title release_date  popularity  vote_average
490             Blowback   2022-06-17      95.007           6.0
491             Scream 3   2000-02-03      75.129           6.0
492         The Key Game   2022-04-13      95.436           5.9
493   Dungeons & Dragons   2000-12-08     101.399           4.3
494  Fifty Shades Darker   2017-02-08      84.395           6.5
495            The Flash   2023-06-14      85.359           0.0
496                 Nope   2022-07-20      86.922           6.9
497          San Andreas   2015-05-27      78.885           6.2
498            I Believe   2019-06-06      87.802           7.0
499               Frozen   2013-11-20      94.184           7.2


In [40]:
print("\nRandom 10 rows:\n", df.sample(10))


Random 10 rows:
                                        title release_date  popularity  \
339                                 Twilight   2008-11-20     105.373   
340                                      Dog   2022-02-17     117.505   
431                              Ratatouille   2007-06-28     102.363   
4               Puss in Boots: The Last Wish   2022-12-07    1531.792   
491                                 Scream 3   2000-02-03      75.129   
158              Lupin The 3rd vs. Cat’s Eye   2023-01-26     197.557   
164                           Naked Rashomon   1973-07-04     156.342   
278                          The Maze Runner   2014-09-10     121.824   
173  Harry Potter and the Chamber of Secrets   2002-11-13     169.272   
5      Prizefighter: The Life of Jem Belcher   2022-06-30    1400.798   

     vote_average  
339           6.3  
340           7.4  
431           7.8  
4             8.3  
491           6.0  
158           6.0  
164           6.7  
278           7.2 

## Bonus

You may want to refer to the [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

Sort your data frame by a particular column and output the result.

Pick a column in your data set with numeric data (integers or floats). Output the mean (average) and median of that column.