# Activity: Structure your data 

## Introduction

In this activity, you will practice structuring, an **exploratory data analysis (EDA)** step that helps data science projects move forward. During EDA, when working with data that contains aspects of date and time, "datetime" transformations are integral to better understanding the data. As a data professional, you will encounter datetime transformations quite often as you determine how to format your data to suit the problems you want to solve or the questions you want to answer. This activity gives you an opportunity to apply these skills and prepare you for future EDA, where you will need to determine how best to structure your data.

In this activity, you are a member of an analytics team that provides insights to an investing firm. To help them decide which companies to invest in next, the firm wants insights into **unicorn companies**–companies that are valued at over one billion dollars.  

You will work with a dataset about unicorn companies, discovering characteristics of the data, structuring the data in ways that will help you draw meaningful insights, and using visualizations to analyze the data. Ultimately, you will draw conclusions about what significant trends or patterns you find in the dataset. This will develop your skills in EDA and your knowledge of functions that allow you to structure data.





## Step 1: Imports 

### Import relevant libraries and modules

Import the relevant Python libraries and modules that you will need to use. In this activity, you will use `pandas`, `numpy`, `seaborn`, and `matplotlib.pyplot`.

In [None]:
# Import the relevant Python libraries and modules needed in this lab.

### YOUR CODE HERE ###


### Load the dataset into a DataFrame

The dataset provided is in the form of a csv file named `Unicorn_Companies.csv` and contains a subset of data on unicorn companies. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [None]:
# RUN THIS CELL TO IMPORT YOUR DATA.

### YOUR CODE HERE ###
companies = pd.read_csv("Unicorn_Companies.csv")

## Step 2: Data exploration


### Display the first 10 rows of the data

In this section, you will discover what the dataset entails and answer questions to guide your exploration and analysis of the data. This is an important step in EDA. 

To begin, display the first 10 rows of the data to get an understanding of how the dataset is structured. 

In [None]:
# Display the first 10 rows of the data.

### YOUR CODE HERE ###



### Identify the number of rows and columns

Identify the number of rows and columns in the dataset. This will help you get a sense of how much data you are working with.

In [None]:
# Identify the number of rows and columns in the dataset.

### YOUR CODE HERE ###



**Question:** How many rows and columns are in the dataset? How many unicorn companies are there? How many aspects are shown for each company?


[Write your response here. Double-click (or enter) to edit.]

### Check for duplicates in the data


In [None]:
# Check for duplicates.

### YOUR CODE HERE ###


**Question:** Based on the preceding output, are there any duplicates in the dataset?


[Write your response here. Double-click (or enter) to edit.]

### Display the data types of the columns 

Knowing the data types of the columns is helpful because it indicates what types of analysis and aggregation can be done, how a column can be transformed to suit specific tasks, and so on. Display the data types of the columns. 

In [None]:
# Display the data types of the columns.

### YOUR CODE HERE ###



**Question:** What do you notice about the data types of the columns in the dataset?


[Write your response here. Double-click (or enter) to edit.]

**Question:** How would you sort this dataset in order to get insights about when the companies were founded? Then, how would you arrange the data from companies that were founded the earliest to companies that were founded the latest?


[Write your response here. Double-click (or enter) to edit.]

### Sort the data

In this section, you will continue your exploratory data analysis by structuring the data. This is an important step in EDA, as it allows you to glean valuable and interesting insights about the data afterwards.

To begin, sort the data so that you can get insights about when the companies were founded. Consider whether it would make sense to sort in ascending or descending order based on what you would like to find.

In [None]:
# Sort `companies` and display the first 10 rows of the resulting DataFrame.

### YOUR CODE HERE ###




**Question:** What do you observe from the sorting that you performed?


[Write your response here. Double-click (or enter) to edit.]

**Question:** Which library would you use to get the count of each distinct value in the `Year Founded` column? 


[Write your response here. Double-click (or enter) to edit.]

### Determine the number of companies founded each year

Find out how many companies in this dataset were founded each year. Make sure to display each unique `Year Founded` that occurs in the dataset, and for each year, a number that represents how many companies were founded then.

In [None]:
# Display each unique year that occurs in the dataset
# along with the number of companies that were founded in each unique year.

### YOUR CODE HERE ###

**Question:** What do you observe from the counts of the unique `Year Founded` values in the dataset?


[Write your response here. Double-click (or enter) to edit.]

**Question:** What kind of graph represents the counts of samples based on a particular feature?


[Write your response here. Double-click (or enter) to edit.]

Plot a histogram of the `Year Founded` feature.

In [None]:
# Plot a histogram of the Year Founded feature.
### YOUR CODE HERE ###

**Question:** If you want to compare when one company joined unicorn status to when another company joined, how would you transform the `Date Joined` column to gain that insight? To answer this question, notice the data types.


[Write your response here. Double-click (or enter) to edit.]

### Convert the `Date Joined` column to datetime

Convert the `Date Joined` column to datetime. This will split each value into year, month, and date components, allowing you to later gain insights about when a company gained unicorn status with respect to each component.

In [None]:
# Convert the `Date Joined` column to datetime.
# Update the column with the converted values.

### YOUR CODE HERE ###




# Display the data types of the columns in `companies`
# to confirm that the update actually took place.

### YOUR CODE HERE ###



**Question:** How would you obtain the names of the months when companies gained unicorn status?


[Write your response here. Double-click (or enter) to edit.]

### Create a `Month Joined` column

Obtain the names of the months when companies gained unicorn status, and use the result to create a `Month Joined` column. 

In [None]:
# Obtain the names of the months when companies gained unicorn status.
# Use the result to create a `Month Joined` column.

### YOUR CODE HERE ###




# Display the first few rows of `companies`
# to confirm that the new column did get added.

### YOUR CODE HERE ###



**Question:** Using the 'Date Joined' column, how would you determine how many years it took for companies to reach unicorn status?


[Write your response here. Double-click (or enter) to edit.]

### Create a `Years To Join` column

Determine how many years it took for companies to reach unicorn status, and use the result to create a `Years To Join` column. Adding this to the dataset can help you answer questions you may have about this aspect of the companies.

In [None]:
# Determine how many years it took for companies to reach unicorn status.
# Use the result to create a `Years To Join` column.

### YOUR CODE HERE ###




# Display the first few rows of `companies`
# to confirm that the new column did get added.

### YOUR CODE HERE ###



**Question:** Which year would you like to gain more insight on with respect when companies attained unicorn status, and why?


[Write your response here. Double-click (or enter) to edit.]

### Gain more insight on a specific year

To gain more insight on the year of that interests you, filter the dataset by that year and save the resulting subset into a new variable. 

In [None]:
# Filter dataset by a year of your interest (in terms of when companies reached unicorn status).
# Save the resulting subset in a new variable. 

### YOUR CODE HERE ###




# Display the first few rows of the subset to confirm that it was created.

### YOUR CODE HERE ###



**Question:** Using a time interval, how could you observe trends in the companies that became unicorns in one year?


[Write your response here. Double-click (or enter) to edit.]

### Observe trends over time

Implement the structuring approach that you have identified to observe trends over time in the companies that became unicorns for the year that interests you.

In [None]:
# After identifying the time interval that interests you, proceed with the following:
# Step 1. Take the subset that you defined for the year of interest. 
#         Insert a column that contains the time interval that each data point belongs to, as needed.
# Step 2. Group by the time interval.
#         Aggregate by counting companies that joined per interval of that year.
#         Save the resulting DataFrame in a new variable.

### YOUR CODE HERE ###





# Display the first few rows of the new DataFrame to confirm that it was created

### YOUR CODE HERE ###




**Question:** How would you structure the data to observe trends in the average valuation of companies from 2020 to 2021?  

[Write your response here. Double-click (or enter) to edit.]

### Compare trends over time

Implement the structuring approach that you have identified in order to compare trends over time in the average valuation of companies that became unicorns in the year you selected above and in another year of your choice. Keep in mind the data type of the `Valuation` column and what the values in that column contain currently.

In [None]:
# After identifying the additional year and time interval of interest, proceed with the following:
# Step 1. Filter by the additional year to create a subset that consists of companies that joined in that year.
# Step 2. Concatenate that new subset with the subset that you defined previously.
# Step 3. As needed, add a column that contains the time interval that each data point belongs to, 
#         in the concatenated DataFrame.
# Step 4. Transform the `Valuation` column as needed.
# Step 5. Group by the time interval.
#         Aggregate by computing average valuation of companies that joined per interval of the corresponding year.
#         Save the resulting DataFrame in a new variable.

### YOUR CODE HERE ###



# Display the first few rows of the new DataFrame to confirm that it was created.

### YOUR CODE HERE ###




## Step 3: Time-to-unicorn visualization

### Visualize the time it took companies to become unicorns

Using the `companies` dataset, create a box plot to visualize the distribution of how long it took companies to become unicorns, with respect to the month they joined. 

In [None]:
# Define a list that contains months in chronological order.

### YOUR CODE HERE ###


# Print out the list to confirm it is correct.

### YOUR CODE HERE ###


            

In [None]:
# Create the box plot to visualize the distribution of how long it took companies to become unicorns, with respect to the month they joined.
# Make sure the x-axis goes in chronological order by month, using the list you defined previously.
# Plot the data from the `companies` DataFrame.

### YOUR CODE HERE ###



# Set the title of the plot.

### YOUR CODE HERE ###



# Rotate labels on the x-axis as a way to avoid overlap in the positions of the text.  

### YOUR CODE HERE ###



# Display the plot.

### YOUR CODE HERE ###




**Question:** In the preceding box plot, what do you observe about the median value for `Years To Join` for each month?


[Write your response here. Double-click (or enter) to edit.]

## Step 4: Results and evaluation


### Visualize the time it took companies to reach unicorn status

In this section, you will evaluate the result of structuring the data, making observations, and gaining further insights about the data. 

Using the `companies` dataset, create a bar plot to visualize the average number of years it took companies to reach unicorn status with respect to when they were founded. 

In [None]:
# Set the size of the plot.

### YOUR CODE HERE ###




# Create bar plot to visualize the average number of years it took companies to reach unicorn status 
# with respect to when they were founded.
# Plot data from the `companies` DataFrame.

### YOUR CODE HERE ###




# Set title

### YOUR CODE HERE ###




# Set x-axis label

### YOUR CODE HERE ###




# Set y-axis label

### YOUR CODE HERE ###




# Rotate the labels on the x-axis as a way to avoid overlap in the positions of the text.  

### YOUR CODE HERE ###



# Display the plot.

### YOUR CODE HERE ###



**Question:** What trends do you notice in the data? Specifically, consider companies that were founded later on. How long did it take those companies to reach unicorn status?


[Write your response here. Double-click (or enter) to edit.]

### Visualize the number of companies that joined per interval 

Using the subset of companies joined in the year of interest, grouped by the time interval of your choice, create a bar plot to visualize the number of companies that joined per interval for that year. 

In [None]:
# Set the size of the plot.

### YOUR CODE HERE ###



# Create bar plot to visualize number of companies that joined per interval for the year of interest.

### YOUR CODE HERE ###



# Set the x-axis label.

### YOUR CODE HERE ###



# Set the y-axis label.

### YOUR CODE HERE ###



# Set the title.

### YOUR CODE HERE ###



# Rotate labels on the x-axis as a way to avoid overlap in the positions of the text.  

### YOUR CODE HERE ###



# Display the plot.

### YOUR CODE HERE ###



**Question:** What do you observe from the bar plot of the number of companies that joined per interval for the year of 2021? When did the highest number of companies reach $1 billion valuation?

  

[Write your response here. Double-click (or enter) to edit.]

### Visualize the average valuation over the quarters

Using the subset of companies that joined in the years of interest, create a grouped bar plot to visualize the average valuation over the quarters, with two bars for each time interval. There will be two bars for each time interval. This allows you to compare quarterly values between the two years.

In [None]:
# Using slicing, extract the year component and the time interval that you specified, 
# and save them by adding two new columns into the subset. 

### YOUR CODE HERE ###



# Set the size of the plot.

### YOUR CODE HERE ###



# Create a grouped bar plot.

### YOUR CODE HERE ###



# Set the x-axis label.

### YOUR CODE HERE ###



# Set the y-axis label.

### YOUR CODE HERE ###



# Set the title.

### YOUR CODE HERE ###



# Display the plot.

### YOUR CODE HERE ###



**Question:** What do you observe from the preceding grouped bar plot?

  

[Write your response here. Double-click (or enter) to edit.]

**Question:** Is there any bias in the data that could potentially inform your analysis?


[Write your response here. Double-click (or enter) to edit.]

**Question:** What potential next steps could you take with your EDA?

[Write your response here. Double-click (or enter) to edit.]

**Question:** Are there any unanswered questions you have about the data? If yes, what are they?


[Write your response here. Double-click (or enter) to edit.]

## Considerations

**What are some key takeaways that you learned from this lab?**

[Write your response here. Double-click (or enter) to edit.]

**What findings would you share with others?**

[Write your response here. Double-click (or enter) to edit.]

**What recommendations would you share with stakeholders based on these findings?**

[Write your response here. Double-click (or enter) to edit.]

**References**

Bhat, M.A. (2022, March).[*Unicorn Companies*](https://www.kaggle.com/datasets/mysarahmadbhat/unicorn-companies). 