Q1. Use the ‘chickwts’ from pydataset for this question.

1a. It’s important to understand the data you analyze. One way to find the description of a dataset
from pydataset is to use the following statement:
data('chickwts', show_doc=True) It’s pointing to the dataset description for R, and there is some R sample code in the document as
well. Don’t worry if you don’t understand the R code. In your code, use a comment to describe this dataset, especially about the variables.

1b. Does each feed type have the same number of chickens? Use appropriate statements to find helpful statistics. Put your answer in your code as a comment or markdown.

1c. Which feed corresponds to the heaviest chickens? Is it more appropriate to use mean or max to
answer this question?

In [None]:
from pydataset import data
import pandas as pd
import numpy as np


# Q1A
# Load the chickwts dataset + Docs
chickwts = data('chickwts')
data('chickwts', show_doc=True)

'''
Question 1A: The chickwts dataset contains data on the weights of chicks fed with different types of feed, they recorder their 
results on newborn chicks 6-weeks after starting to feed. 

The dataset has two variables:
1. weight: The weight of the chicks
2. feed: The type of feed given to the chicks 
'''

# Q1B
feed = chickwts['feed'].value_counts()
print(feed)

'''
Question 1B: No, each feed type does not have the same number of chickens.
soybean      14
linseed      12
casein       12
sunflower    12
meatmeal     11
horsebean    10

'''

# Q1C

# Mean
mean_weights = chickwts.groupby('feed')['weight'].mean()
print(mean_weights)
heaviest_feed = mean_weights.idxmax()
print(f"Heaviest chickens eat (mean): {heaviest_feed}")

# Max
max_weights = chickwts.groupby('feed')['weight'].max()
print(max_weights)
heaviest_feed_max = max_weights.idxmax()
print(f"Heaviest chickens eat (max): {heaviest_feed_max}")

''' 
Question 1C: The feed that corresponds to the heaviest chickens is 'sunflower'. Using mean gets us the overall average weight of chickens for each feed type
while max only gets us the single heaviest chicken of each, for all we could know it could be a fluke that one max opposed to the trend amongst the feed and chickens

Mean Weights:
casein       323.583333
horsebean    160.200000
linseed      218.750000
meatmeal     276.909091
soybean      246.428571
sunflower    328.916667

Max Weights:
feed
casein       404
horsebean    227
linseed      309
meatmeal     380
soybean      329
sunflower    423
'''


chickwts

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Chicken Weights by Feed Type

### Description

An experiment was conducted to measure and compare the effectiveness of
various feed supplements on the growth rate of chickens.

### Usage

    chickwts

### Format

A data frame with 71 observations on 2 variables.

weight

a numeric variable giving the chick weight.

feed

a factor giving the feed type.

### Details

Newly hatched chicks were randomly allocated into six groups, and each group
was given a different feed supplement. Their weights in grams after six weeks
are given along with feed types.

### Source

Anonymous (1948) _Biometrika_, **35**, 214.

### References

McNeil, D. R. (1977) _Interactive Data Analysis_. New York: Wiley.

### Examples

    require(stats); require(graphics)
    boxplot(weight ~ feed, data = chickwts, col = "lightgray",
        varwidth = TRUE, notch = TRUE, main = "chickwt data",
        ylab = "Weight

" \nQuestion 1C: The feed that corresponds to the heaviest chickens is 'sunflower'. Using mean gets us the overall average weight of chickens for each feed type\nwhile max only gets us the single heaviest chicken of each, for all we could know it could be a fluke that one max opposed to the trend amongst the feed and chickens\ncasein       323.583333\nhorsebean    160.200000\nlinseed      218.750000\nmeatmeal     276.909091\nsoybean      246.428571\nsunflower    328.916667\n"

Q.2 To work with Pandas date/time variables and basic analysis. Use the “economics” dataset from the
pydataset package. You need to make sure you understand the variables in order to answer the
questions below.

2a. Use df.dtypes to check the datatype of the columns. What’s the data type for ‘date’?

2b. Create a new column ‘year’ to have the year information from the ‘date’.

2c. Calculate an annual employment rate for each year. Is it better to use average annual
unemployed or total annual unemployed in this calculation? Provide a reasoning for your
calculation.

2d. Which year has the highest unemployment rate?


In [None]:
from pydataset import data
import pandas as pd
import numpy as np

# Load the dataset + docs
economics = data('economics')
data('economics', show_doc=True)

''' 
Loading in the econ dataset which uses variavles such as:
1. date: Date of observation
2. pce: Personal consumption expenditures, in billions of dollars
3. pop: Total population, in thousands
4. psavert: Personal savings rate
5. uempmed: Median duration of unemployment, in weeks
6. unemploy: Number of unemployed in thousands

'''

# Q2A
print(economics.dtypes)


''' 
Question 2A: The data type of date seems to be an object, the rest of the fields are standard/primitive datatypes
date         object
pce         float64
pop           int64
psavert     float64
uempmed     float64
unemploy      int64

'''
# Q2B

# Convert date object to datetime
economics['date'] = pd.to_datetime(economics['date'])

# Extract year from translated data
economics['year'] = economics['date'].dt.year

print(economics.dtypes)

''' 
Question 2B: Converted the date object to datetime and extracting the year, the data types are now:
date        datetime64[ns]
pce                float64
pop                  int64
psavert            float64
uempmed            float64
unemploy             int64
year                 int32
'''

# Q2C

# Calc annual unemployment rate
annual_stats = economics.groupby('year').agg({
    'unemploy': 'mean',  # average unemployed
    'pop': 'mean'        # average population
}).reset_index()

# Unemployment rate as percentage
annual_stats['unemployment_rate'] = (annual_stats['unemploy'] / annual_stats['pop']) * 100

print(annual_stats[['year', 'unemployment_rate']])

'''
Question 2C: You want to use the average to see the trend during the year opposed to the max which could be a fluke or even just a moment
within said year that had a spike in unemployment. The average gives a better overall picture of the year as a whole, if we were looking at month by month basis
then using max would make a bit more sense to see the spikes, but for annual trends average is the better choice.
'''

# Q4D

print(annual_stats.sort_values('unemployment_rate', ascending=False).head(3))
''' 
Question 4D: The years with the highest unemployment was 1982

    year      unemploy            pop  unemployment_rate
15  1982  10893.000000  232308.250000           4.689028
16  1983  10483.250000  234418.416667           4.472025
25  1992   9614.666667  257065.916667           3.740156
'''

economics

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## US economic time series.

### Description

This dataset was produced from US economic time series data available from
http://research.stlouisfed.org/fred2.

### Usage

    data(economics)

### Format

A data frame with 478 rows and 6 variables

### Details

  * date. Month of data collection 

  * psavert, personal savings rate, http://research.stlouisfed.org/fred2/series/PSAVERT/

  * pce, personal consumption expenditures, in billions of dollars, http://research.stlouisfed.org/fred2/series/PCE

  * unemploy, number of unemployed in thousands, http://research.stlouisfed.org/fred2/series/UNEMPLOY

  * uempmed, median duration of unemployment, in week, http://research.stlouisfed.org/fred2/series/UEMPMED

  * pop, total population, in thousands, http://research.stlouisfed.org/fred2/series/POP


date         object
pce         float64
pop           int64
psavert     float64
uempmed     f

' \nQuestion 4D: The years with the highest unemployment rates are:\n'