Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [None]:
#Imports

import pandas as pd
import numpy as np
import plotly.express as px

In [None]:
#Import our data

#URL to our data on my github repo for build week
url = 'https://raw.githubusercontent.com/JeremySpradlin/DS-Unit-2-Build-Week/master/sunspot_data.csv'


df = pd.read_csv(url)
df.head()

Unnamed: 0.1,Unnamed: 0,Year,Month,Day,Date In Fraction Of Year,Number of Sunspots,Standard Deviation,Observations,Indicator
0,0,1818,1,1,1818.001,-1,-1.0,0,1
1,1,1818,1,2,1818.004,-1,-1.0,0,1
2,2,1818,1,3,1818.007,-1,-1.0,0,1
3,3,1818,1,4,1818.01,-1,-1.0,0,1
4,4,1818,1,5,1818.012,-1,-1.0,0,1


### Choose your Target.  

Which column in your tabular dataset will you predict?
- My target column is `Number of Sunspots`, predicting the number of sun spots on a particular date.

### Is your problem regression or classification
- My problem is a regression problem

### How is my target distributed?
- My target column over time displays a particular pattern in the rise and fall of the number of sun spots, it is not right-skewed

In [None]:
px.scatter(df, x='Date In Fraction Of Year', y='Number of Sunspots', color='Observations')

### Choose your Evaluation Metrics

Since this is a regression problem, I plan on using the following for metrics:
- MAE
- MSE
- R^2


### Choose which observations you will use to train, validate, and test your model.

- Are some observations outliers? Will you exclude them?
-- Some observations are outliers.  On some days, there are no observations recorded.  We can see these in the above graphs as -1's across the bottom of the data.  These will need to be removed.

- Will you do a random split or a time-based split?
-- Since we are looking at predicting a sun spot value at a particular date in time, I will conduct a split based on time, to ensure that all datasets maintain chronological order to better train our model.

### Being to clean and explore dataset

Below we will further explore our dataset, working to determine new features we might be able to use, what features might not be worth including, etc.

In [None]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Year,Month,Day,Date In Fraction Of Year,Number of Sunspots,Standard Deviation,Observations,Indicator
count,73718.0,73718.0,73718.0,73718.0,73718.0,73718.0,73718.0,73718.0,73718.0
mean,36858.5,1918.41675,6.518896,15.729347,1918.916406,79.248732,6.924587,4.429678,0.998331
std,21280.697909,58.264401,3.447114,8.800032,58.26452,77.470942,4.778793,7.884112,0.040814
min,0.0,1818.0,1.0,1.0,1818.001,-1.0,-1.0,0.0,0.0
25%,18429.25,1868.0,4.0,8.0,1868.4585,15.0,3.0,1.0,1.0
50%,36858.5,1918.0,7.0,16.0,1918.9175,58.0,6.6,1.0,1.0
75%,55287.75,1969.0,10.0,23.0,1969.37325,125.0,10.0,1.0,1.0
max,73717.0,2019.0,12.0,31.0,2019.832,528.0,77.7,60.0,1.0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73718 entries, 0 to 73717
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                73718 non-null  int64  
 1   Year                      73718 non-null  int64  
 2   Month                     73718 non-null  int64  
 3   Day                       73718 non-null  int64  
 4   Date In Fraction Of Year  73718 non-null  float64
 5   Number of Sunspots        73718 non-null  int64  
 6   Standard Deviation        73718 non-null  float64
 7   Observations              73718 non-null  int64  
 8   Indicator                 73718 non-null  int64  
dtypes: float64(2), int64(7)
memory usage: 5.1 MB


### Missing values
Taking a cursory glance at the data doesn't display any NANs, however we know from the information provided with the dataset that `-1` is used in the `Number of Sunspots` target column to indicate days that there is no recorded data for that day.  
- It's important that we don't replace the `-1` with 0, otherwise we might conflate not having data with having data that there are 0 sunpots.  

In [None]:
#Replace -1 with NaN in our Number of Sunspots target column

df['Number of Sunspots'].replace(-1, np.NaN, inplace=True)
df.head()

Unnamed: 0.1,Unnamed: 0,Year,Month,Day,Date In Fraction Of Year,Number of Sunspots,Standard Deviation,Observations,Indicator
0,0,1818,1,1,1818.001,,-1.0,0,1
1,1,1818,1,2,1818.004,,-1.0,0,1
2,2,1818,1,3,1818.007,,-1.0,0,1
3,3,1818,1,4,1818.01,,-1.0,0,1
4,4,1818,1,5,1818.012,,-1.0,0,1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73718 entries, 0 to 73717
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                73718 non-null  int64  
 1   Year                      73718 non-null  int64  
 2   Month                     73718 non-null  int64  
 3   Day                       73718 non-null  int64  
 4   Date In Fraction Of Year  73718 non-null  float64
 5   Number of Sunspots        70471 non-null  float64
 6   Standard Deviation        73718 non-null  float64
 7   Observations              73718 non-null  int64  
 8   Indicator                 73718 non-null  int64  
dtypes: float64(3), int64(6)
memory usage: 5.1 MB


In [None]:
df['Number of Sunspots'].value_counts()

 0      11101
-1       3247
 12      1303
 27       920
 23       918
        ...  
 484        1
 356        1
 463        1
 518        1
 335        1
Name: Number of Sunspots, Length: 438, dtype: int64

Looking at the info after changing the -1's to NAN, we can see there were a little over 3000 observations that we'll lose if we drop them.  With a total observations of over 73k to begin with, we should still be good with just dropping these in our wrangle function.  

In [None]:
px.scatter(df, x='Date In Fraction Of Year', y='Indicator')


In [None]:
df['Indicator'].value_counts()


1    73595
0      123
Name: Indicator, dtype: int64

### `Indicator` Column
According to the datasets page on Kaggle, Indicator is `Definitive/provisional indicator. A blank indicates that the value is definitive. A '*' symbol indicates that the value is still provisional and is subject to a possible revision (Usually the last 3 to 6 months)`

Look at the column in more detail, we see that there are only two values in the column, a 1 or a 0, with the vast majority of the values being 1, though this specific value is not listed in the columns info.  Given the lack of diversity in the column, we will remove tis column as well.

In [None]:
px.scatter(df, x='Date In Fraction Of Year', y='Observations')

### `Observations` Column

Looking at the above plot, we can see that, until recently, observations have mostly been at 1.  From Kaggle: `Before 1981, the number of observations is set to 1, as the Sunspot Number was then essentially the raw Wolf number from the Zürich Observatory.` 

I want to explore this metric in future data explorations, but I will initially remove this column due to it not having any diversity until the last few decades

### Standard Deviation Column

In [None]:
px.scatter(df, x='Date In Fraction Of Year', y='Standard Deviation')

More exploration is needed to understand the predicative ability of this column, if any exists, however a quick plot shows patterns similar to what we see when looking at our number of sun spots, so for now we will keep this column in our data. 

We can see some very obvious anomalies in the data, especially in the first several decades, as well as some in the first half of the 20th century.  below we will explore this column and determine how best to handle the anomalies. 

### Make Wrangle Function

From our EDA above, we need to perform the following on our data set:
- `Number of Sunspots` replace -1 with NAN
- Drop the `Indicator` column
- Drop the `Observations` column
-- Final `wrangle()` will contain a flag for preparing the column to include, rather than just remove the column
- 