## Introduction to Data Science

#### University of Redlands - DATA 101
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data101.joannabieri.com](https://joannabieri.com/data101.html)

---------------------------------------
# Homework Day 8
---------------------------------------

GOALS:

1. Load data into Python that you find online
2. Understand data types and fix some errors
3. Find your own data to play with

----------------------------------------------------------

This homework has **5 questions** and **1 problem**.

NOTE: Be kind to yourself. Working with data can be hard! Every data set is different. **Seriously** come get help! Come to lab!


In [4]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

## Try reading in some data - csv

Go to the [Cal Fire Website](https://www.fire.ca.gov/incidents) and scroll to the bottom to see the Incident Data. We will download the file named **ALL DATA AS CSV** this should put the data file into your Downloads folder. 

Next you need to move the file **mapdataall.csv** from your Downloads folder into your Day8 folder where you are doing your homework. You can open your Downloads folder and drag the file into JupyterLab side bar. Then I can run the command

    DF_raw = pd.read_csv('mapdataall.csv')

to load the data and look at the data frame.

In [6]:
# Your code here
DF_raw = pd.read_csv('mapdataall.csv')
show(DF_raw)

FileNotFoundError: [Errno 2] No such file or directory: 'mapdataall.csv'

In [None]:
DF_raw.shape

In [None]:
show(DF_raw['incident_type'].value_counts())

In [None]:
fig = px.histogram(DF_raw,
                 x='incident_acres_burned',
                 nbins=10,
                 color = 'calfire_incident')

fig.update_layout(bargap=0.05,
                  title='Amount of acres burned',
                  title_x=0.3,
                  yaxis_title="Frequency",
                  xaxis_title="fires",
                  autosize=False,
                  width=700,
                  height=600)

fig.show()

In [None]:
mask = DF_raw['incident_acres_burned'] > 500000
DF_raw[mask]

Q1 How many variables and observations?

The data has 2,752 observations and 23 variables.

Q2 How many different incident types are there?

There are three incident types. Wildfire, Fire, and Flood.

Q3 Make a histogram of the acres burned and color the bars by whether or not the incident was a calfire incident. You will probably need to make a mask to remove very small and very large fires. How many fires burned more than 100,000 acres? What is the largest fire in the data?

There are 20 fires that have burned over 100,000 acres. The largest fire in the data is the August Complex (includes Doe Fire). This fire has burned 1032648.0 acres.

**Extra Q** EXTRA - CHALLENGE - See if you can create a graph that answers the question: Are fires getting bigger or more frequent over time? You get complete creative control on how to answer this question!

In [None]:
# You will need to write some code to answer the questions.


**(Click Here)**



In [3]:
# Extra Code


## Try reading in some data from Wikipedia - html

Here we will explore academy award winning films. Go to the [Wiki for the List of Academy Award Winning Films](https://en.wikipedia.org/wiki/List_of_Academy_Award%E2%80%93winning_films). Look at what type of data is there. How many tables? Any weird looking data?

Now read the html data into Python and show the data in DF[0]

In [None]:
DF_raw = DF[0]
show(DF_raw)

In [None]:
DF_raw.shape

In [10]:
# Here is some helper code
# This is code that will read in the data and then fix the Year column
my_website = "https://en.wikipedia.org/wiki/List_of_Academy_Award%E2%80%93winning_films"
DF = pd.read_html(my_website)
DF_raw = DF[0]
DF_raw['Year'] = DF_raw['Year'].apply(lambda x: int(x.split('/')[0]))
DF_raw['Year'].value_counts().reset_index().rename(columns={"index": "value", 0: "count"})

Unnamed: 0,Year,count
0,1945,21
1,1949,20
2,1942,20
3,1950,19
4,1948,19
...,...,...
91,1931,10
92,1932,9
93,1928,7
94,1929,6


**Q4** Following along with the lecture notes or video, fix the data in the 'Awards' column.

In [12]:
# Your code here
award_data = DF_raw['Awards'].value_counts()
DF_award = award_data.reset_index().rename(columns={"index": "value", 0: "count"})
DF_award

In [11]:
# Your code here
DF_raw['Awards'] = DF_raw['Awards'].apply(lambda x: int(x.split('(')[0]))

**Q5** Now try to fix the data in the "Nominations" column - see if you can do it without looking at the answer.

In [None]:
nom_data = DF_raw['Nominations'].value_counts()
DF_nom = nom_data.reset_index().rename(columns={"index": "value", 0: "count"})
show(DF_nom)
print('I can see that I want the data to the left of the [ character')

DF_raw['Nominations'] = DF_raw['Nominations'].apply(lambda x: int(x.split('[')[0]))
award_data = DF_raw['Awards'].value_counts()
DF_award = award_data.reset_index().rename(columns={"index": "value", 0: "count"})
show(DF_award)

DF_raw.dtypes

## Problem 1

Your homework today will be to see if you can find some data of your own. This can be the first steps you take toward your final project. 

You should:

* Find some data online
* Read that data into Python using the Pandas commands we learned
* Look at the DataFrame - number of variables, number of observations, AND the dtypes. Comment on what you see.
* Try to do summary statistics (.describe()). Does it work like expected?
* Attempt to fix any data, or explain why the data does not need to be fixed.
* Make some sort of graph using columns in your data.

In [None]:
DF_raw2 = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release.csv')
show(DF_raw2)

In [None]:
DF_raw2.shape

In [None]:
columns_list = list(DF_raw2.keys())
print(columns_list)

In [None]:
my_columns = ['LowConfidenceLimit', 'HighConfidenceLimit']
DF_raw2[my_columns].describe()

This is data for indicators of chronic diseases. There are 1185676 observations and 34 variables. Most of the stratification categories seems to be NaN. I'm using .describe to describe to me the confidence limits which seem to be the confidence in which people are feeling they have with being diagnosed. The 'StratificationCategoryID2', 'StratificationID2', 'StratificationCategoryID3', 'StratificationID3', and 'TopicID' categories need to be fixed. This is because all of them are filled with NaN. I beilieve that I need more practice/help learning with these concepts.