<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST3512/blob/main/AnalysisReportTemplate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Analysis Report Template    


**A shared resource for the organization of data science analysis and report presentation.**

Prepared by CUNY CityTech CST3512 class, Spring 2022.   



**OUTLINE**    

* Executive Summary   
* Background   
* Problem Statement   
* Approach    
* Findings    
* Model/Analysis Design   
* One or Themes (deeper findings)    
* Implications (so what?)     
* Recommendation(s)   
* Next Steps (implement/additional-research)   


* APPENDICES    
1.   Footnotes    
2.   Glossary of terms    
3.   Additional reading   


##Executive Summary    



##Background    



##Problem Statement    



##Approach    



##Findings    


##Model/Design Analysis    



##Themes    



##Implications    



##Recommendations    


##Next Steps    




---



##Appendix 1 - Footnotes    




---



##Appendix 2 - Glossary of Terms    




---



##Appendix 3 - Additional Reading    




---



##Appendix 4 - RESOURCES - Data Sources to Consider   

* **Kaggle** - Data sources across disciplines, time periods, and geographies
* **OWID** - Our World in Data
* **United Nations** -- Many global datasets
* **Project Gutenberg** - Open-source text of works of literature
* **Data.gov** - A variety of sources of data on government measures    
* **U.S. Bureau of Labor Statistics (BLS)** -- Labor and employment data in the United States    
* **NYC Open Data** - Multiple data sources from New York City Government
* **FBI Uniform Crime Reporting (UCR)** -- Aggregated law enforcement data from across the United States
* **Yahoo Finance** - Time-series market data and a variety of data sources about public companies
* **EDGAR** - Security and Exchange Commission (SEC) data of corporate filings
* **Health Data** - Various data sources
* **Weather** - Various data sources
* **Geolocation** - Data sources on Named Entity : Longitude/Lattitude maps, and other location-specific data   


*note: **Google** curates a [catalog of open data sources](google.com/publicdata/directory)*


###Kaggle     

The following dataset contains five years of data regarding Netflix's stock prices. Ranging from February 5th, 2018 - February 5th, 2022.

https://www.kaggle.com/datasets/jainilcoder/netflix-stock-price-prediction

###OWID     

place OWDI info and links in this section

###United Nations    

place UN info and links here    


###Project Gutenberg     

place Project Gutenberg info and links in this section   


###Data.gov    

place data.gov info and links in this section    



###U.S. Bureau of Labor Statistics (BLS)    

place BLS info here.    

*note: more online at https://www.bls.gov/*

### NYC Open Data    

place New York City Open Data info and links in this section    


In [None]:
https://data.gov

###FBI Uniform Crime Reporting (UCR)    

Aggregate data from law enforcement across the United States at: https://www.fbi.gov/services/cjis/ucr    


### Yahoo Finance    

place Yahoo Finance info and links here    



###EDGAR (U.S. SEC)    

place SEC Edgar info and links in this section    


###Weather 

1.    [OpenWeatherMap](https://openweathermap.org/weathermap) - Dynamic weather maps.    

2.    [NOAA's Geoplatform](https://www.climate.gov/maps-data/all) - Data, maps, analytics on weather for U.S. National Oceanographic and Atmospheric Administration.     


###Google Catalog of Open Data    

place Google Catalog of Open Data info and links here.

See `google.com/publicdata/directory` for the home page.

###DataWorld.com    

[Data.World](https://data.world) is a collection of availabl data source for research and exercises.    


##Appendix 5 - Data Discovery Approach

Each problem and the data sources considered will raise unique considerations but there are several typical steps to data extraction, discovery, and transformation which are helpful to consider.    

The following is derived from the work of Anmol Tomar in CodeX, entitled "[Every Data Analysis in 10 steps!  Adding structure to your data analysis!](https://medium.com/codex/every-data-analysis-in-10-steps-960dc7e7f00b)"

###File Upload

**IMDB Dataset**    

For  illustration purposes, this analysis uses the Kaggle IMDB dataset for the top 1000 movies to understand the features/traits of top IMDB movies by applying a 10 step-process.    *The Kaggle file used in this notebook differs from the source file used by Anmol Tomar in the article referenced above.*  

A copy of the file is hosted on Professor Patrick's GitHub and can be accsed with `!curl` to upload a copy to the `content` folder in Colab.    


In [None]:
!curl "https://raw.githubusercontent.com/ProfessorPatrickSlatraigh/data/main/IMDB_top_1000.csv" -o imdb_top_1000.csv

###Import Packages    

The load, discovery, and transformation steps will only require that the `pandas` and `matplotlib` packages be imported.    

To add functionality to data tables in Colab, import `data_table` from `google.colab`.

In [2]:
# import our packages 
import pandas as pd 
import matplotlib.pyplot as plt

from google.colab import data_table
data_table.enable_dataframe_formatter()

###Read the Data into a Dataframe    


In [21]:
# reading the data
df_movies =  pd.read_csv('imdb_top_1000.csv')

###1. Summarize the Columns    



In [None]:
# summary of the columns   
df_movies.describe(include = 'all')

A detailed pivot table can be viewed as well to peruse the data.    

In [None]:
# query the detail in the dataframe
df_movies

###2. Data Types    

The next step is to do a sanity check of the data types of the columns of the dataframe. If there are some incorrect data types they can be corrected in this step.

In [None]:
# check the datatypes 
print(df_movies.dtypes)

The field `Duration` is a string (object data type) with the runtime in minutes followed by the term 'min'.  A new column can be generated for `runtime` as a float by assigning the result of a split of `Duration` and the transformation of the first element in the resulting list into a float. 

In [32]:
# Extract the numeric value of movie Duration to a new column Runtime 
df_movies[['Runtime', 'Unit']] = df_movies["Duration"].str.split(' ', 1, expand=True)
df_movies['Runtime'] = df_movies['Runtime'].astype(int)
del df_movies['Unit']

In [None]:
# Query the dataframe after the transformation   
df_movies

In [None]:
# Another approach would simply replace the `Duration` column in place with it's float value
# Transformting `Duration` into int 

# >>> Remove the comment from the following line of code to try it
# df_movies['Duration'] = df_movies['Duration'].str.replace(' min','').astype(int)


In [None]:
# Review the data types of each column again
df_movies.dtypes

###3. Missing Values    

The third step is to find the number of missing values across the columns of the dataframe. It’s important to understand the count of nulls so determine how best to treat them.    

In [None]:
# find nulls 
df_movies.isnull().sum()

###4. Missing values treatment    

Using the count of missing values and any other descriptive statistics applicable, the next step is to treat the columns with missing values.    

For illustration purposes, the nulls are filled with the mean value of the columns, although there are more sophisticated methods of missing value treatment.   


In [None]:
# Replacing nulls with mean for numeric values and mode for categorical values

df_movies['Metascore'].fillna(df_movies['Metascore'].mean())

df_movies['Certificate'].fillna(df_movies['Certificate'].mode())


In [None]:
df_movies.head()


###5. Outliers    

The fifth step is to check for outliers. There are multiple ways of checking the outliers, the graphical method vis presented here. Two continuous variables (`Metascore` and `Runtime`) have been select to be checked for outliers by evaluating a histogram for each column.    

In [None]:
# Distribution of Metascores 
plt.hist(df_movies['Metascore'],bins = 25)
plt.show()

In [None]:
# Distribution of Runtimes 
plt.hist(df_movies['Runtime'],bins = 25)
plt.show()

### 6. Outlier Treatment


The next step is to treat the outliers observed in the previous step. There are different ways of treating the outliers such as:     
1. Capping the minimum and maximum value limits
2. Removing the rows with outlier values   


Although there is nothing off with the distribution of Metascores, for illustration purposes, the minimum `Metascore` value is capped at 65.    


And with respect to the Runtimes, any rows with a `Runtime` in excess of 200 is deleted as are rows with a `Runtime` of less than 60.   



In [None]:
# Capping the minimum Metascore to 65
df_movies.loc[df_movies['Metascore'] < 65,'Metascore'] = 65

#check the minimum score 
df_movies['Metascore'].min()
# output : 65.0

In [53]:
# Dropping any rows where Runtime exceeeds 200 or is below 60
df_movies.drop(df_movies[(df_movies.Runtime > 200) | (df_movies.Runtime < 60)].index, inplace=True)  


In [None]:
# distribution of Runtimes 
plt.hist(df_movies['Runtime'],bins = 25)
plt.show()

###7. Who    


The seventh step to answer are questions related to people, members, stakeholders, etc. 

In films, there are actors, directors, and cast members.  Some 'who' questions to raise could include:  

* Who has directed the most number of top IMDB movies? (univariate)    


* Who has acted in most top IMDB movies? (univariate)    


* Which Actor-Director combination has the most top IMDB movies? (bivariate)    


* Who provided the most music in top IMDB movies ? (Data not available)    


And More …    



*to use the code in the following snippet, extraction of data values from the `Cast` column would be required as a preliminary transformation.*


In [None]:
### >>> The sample code here relies on columns in the original dataset used by Anmol Tomar
### >>> The columns required by the code in this snippet are not available in the Kaggle dataset

## Who has directed the most number of top IMDB movies ?
# df_movies.groupby(['Director']).agg({'Series_Title':'count'}).reset_index().rename(columns = {'Series_Title':'count'}).\
# sort_values('count',ascending = False).head(5)

## Who has acted in the most number of top movies 
# df_movies.groupby(['Star1']).agg({'Series_Title':'count'}).reset_index().rename(columns = {'Series_Title':'count'}).\
# sort_values('count',ascending = False).head(5)

## Director - Actor works best 
# df_movies.groupby(['Director','Star1'])['Series_Title'].count().reset_index().\
# rename(columns = {'Series_Title':'Count'}).sort_values('Count',ascending = False).head(5)    


###8. When    


The eighth step is to answer questions related to the time dimension (year, quarter, month, week, day, time-of-day, hour, minute, etc.)    

Considering film data, the following type of question could be asked:    


* Find the years with most movies in IMDB top 1000 ? (univariate)    




*to use the code in the following snippet, extraction of data values from the `Title` column would be required as a preliminary transformation.*

In [None]:
### >>> The sample code here relies on columns in the original dataset used by Anmol Tomar
### >>> The columns required by the code in this snippet are not available in the Kaggle dataset

## Finding years with most movies in top 1000
# year_dis = df_movies.groupby('Released_Year')['Series_Title'].count().reset_index().\
# rename(columns = {'Series_Title':'Count'}).sort_values('Count',ascending = False).head(10)

# plt.bar(year_dis['Released_Year'].astype(str), year_dis['Count'], width = 0.5)
# plt.xlabel('Years')
# plt.ylabel('Number of Movies')
# plt.title('Years with most movies in IMDB top 1000')
# plt.show()

###9. Where    

The ninth steo is to look at the things from the “place” perspective, for example, country, state, regions etc.  Geolocation can also be used in determing the 'where' attribute of data. 

For film data, the following question could be posed:    


* Find countries with most movies in IMDB top 1000.


The dataset does not have the data to answer this question.   
Research should be as exhaustive as possible and not limited based on data availability.  Additional sources or enrichment approaches should be considered to answer pertinent questions.    



###10. What/Which/How    


The tenth and final step is formulating questions about aspects not covered in the first nine steps. These questions are not related to people, place, or time but everything apart from these. Formulating such questions can be quite subjective and takes some time and experience to develop intuition and a facility.

With respect to film data, such question might include:    

* Which genres are featured most in the top 1000?


* What is the duration of the top movies?


* What is the correlation between the rating and gross earning?


and more…

For illustration purposes, the first question is approached using the following code:



In [None]:
## Which genres of film are featured most in top 1000 ? 
genre_dis = df_movies.groupby('Genre')['Title'].count().reset_index().\
rename(columns = {'Title':'Count'}).sort_values('Count',ascending = False).head(5)
fig, ax = plt.subplots()
plt.bar(genre_dis['Genre'], genre_dis['Count'], width = 0.5)
plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')
plt.show()



---



We can refine the exercises in this appendix through the inclusion of the requisite trasnformations to wrangle the data required for additional analysis.   




---

