# The Distribution of First Digits

In this lab, you will explore the distribution of first digits in real data. For example, the first digits of the numbers 52, 30.8, and 0.07 are 5, 3, and 7 respectively. In this lab, you will investigate the question: how frequently does each digit 1-9 appear as the first digit of the number?


## Question 0

Make a prediction. 

1. Approximately what percentage of the values do you think will have a _first_ digit of 1? What percentage of the values do you think will have a first digit of 9?
2. Approximately what percentage of the values do you think will have a _last_ digit of 1? What percentage of the values do you think will have a last digit of 9?

(Don't worry about being wrong. You will earn full credit for any justified answer.)

Percentage of values having the first digit of 1 is 10%.
Percentage of values having the first digit of 9 is 10%.
Percentage of values having the last digit of 1 is 10%.
Percentage of values having the last digit of 9 is 10%.

I predict this because the odds are the first and/or last digit is about 1 out of 9, so it's about 10-11%.

## Question 1

The [S&P 500](https://en.wikipedia.org/wiki/S%26P_500_Index) is a stock index based on the market capitalizations of large companies that are publicly traded on the NYSE or NASDAQ. The CSV file `sp500.csv` contains data from February 1, 2018 about the stocks that comprise the S&P 500. We will investigate the first digit distributions of the variables in this data set.

Read in the S&P 500 data. What is the unit of observation in this data set? Is there a variable that is natural to use as the index? If so, set that variable to be the index. Once you are done, display the `DataFrame`.

In [31]:
# ENTER YOUR CODE HERE.
import pandas as pd
df = pd.read_csv("sp500.csv")
df.head()

Unnamed: 0,date,Name,open,close,volume
0,2018-02-01,AAL,$54.00,$53.88,3623078
1,2018-02-01,AAPL,$167.16,$167.78,47230787
2,2018-02-01,AAP,$116.24,$117.29,760629
3,2018-02-01,ABBV,$112.24,$116.34,9943452
4,2018-02-01,ABC,$97.74,$99.29,2786798


In [32]:
df.set_index('Name').head()

Unnamed: 0_level_0,date,open,close,volume
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAL,2018-02-01,$54.00,$53.88,3623078
AAPL,2018-02-01,$167.16,$167.78,47230787
AAP,2018-02-01,$116.24,$117.29,760629
ABBV,2018-02-01,$112.24,$116.34,9943452
ABC,2018-02-01,$97.74,$99.29,2786798


In [33]:
df.shape

(505, 5)

**ENTER YOUR WRITTEN EXPLANATION HERE.**

Name should be the index of the data frame since it contains the most unique identity of each row and its characteristics.

## Question 2

We will start by looking at the `volume` column. This variable tells us how many shares were traded on that date.

Extract the first digit of every value in this column. (_Hint:_ First, turn the numbers into strings. Then, use the [text processing functionalities](https://pandas.pydata.org/pandas-docs/stable/text.html) of `pandas` to extract the first character of each string.) Make an appropriate visualization to display the distribution of the first digits. (_Hint:_ Think carefully about whether the variable you are plotting is quantitative or categorical.)

How does this compare with what you predicted in Question 0?

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505 entries, 0 to 504
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    505 non-null    object
 1   Name    505 non-null    object
 2   open    505 non-null    object
 3   close   505 non-null    object
 4   volume  505 non-null    int64 
dtypes: int64(1), object(4)
memory usage: 19.9+ KB


In [38]:
# ENTER YOUR CODE HERE.

import matplotlib 
%matplotlib inline

df['volume'] = df['volume'].astype('int')
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505 entries, 0 to 504
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    505 non-null    object
 1   Name    505 non-null    object
 2   open    505 non-null    object
 3   close   505 non-null    object
 4   volume  505 non-null    int64 
dtypes: int64(1), object(4)
memory usage: 19.9+ KB


In [66]:
df['first_digit'] = df['volume'].apply(lambda x: int(str(x)[0])).sort_index()
df['first_digit'] 

0      3
1      4
2      7
3      9
4      2
      ..
500    1
501    1
502    1
503    3
504    2
Name: first_digit, Length: 505, dtype: int64

In [77]:
df['first_digit'].value_counts()

first_digit
1    165
2     93
3     59
4     43
5     41
6     36
7     25
8     22
9     21
Name: count, dtype: int64

In [83]:
prob_of_firsts = df['first_digit'].value_counts() / df['first_digit'].value_counts().sum()
prob_of_firsts 

first_digit
1    0.326733
2    0.184158
3    0.116832
4    0.085149
5    0.081188
6    0.071287
7    0.049505
8    0.043564
9    0.041584
Name: count, dtype: float64

**ENTER YOUR WRITTEN EXPLANATION HERE.**

The digit 1 has the most occurrences within the dataset at 32%.
This is significantly different from what I predicted in my hypothesis of how I thought each number had an equal chance of occuring in the dataset.

## Question 3

Now, repeat Question 2, but for the distribution of _last_ digits. Again, make an appropriate visualization and compare with your prediction in Question 0.

In [90]:
# ENTER YOUR CODE HERE.
df['last_digit'] = df['volume'].apply(lambda x: int(str(x)[-1]))
df

Unnamed: 0,date,Name,open,close,volume,first_digit,last_digit
0,2018-02-01,AAL,$54.00,$53.88,3623078,3,8
1,2018-02-01,AAPL,$167.16,$167.78,47230787,4,7
2,2018-02-01,AAP,$116.24,$117.29,760629,7,9
3,2018-02-01,ABBV,$112.24,$116.34,9943452,9,2
4,2018-02-01,ABC,$97.74,$99.29,2786798,2,8
...,...,...,...,...,...,...,...
500,2018-02-01,XYL,$72.50,$74.84,1817612,1,2
501,2018-02-01,YUM,$84.24,$83.98,1685275,1,5
502,2018-02-01,ZBH,$126.35,$128.19,1756300,1,0
503,2018-02-01,ZION,$53.79,$54.98,3542047,3,7


In [92]:
last_digit_counts = df['last_digit'].value_counts()
last_digit_total = df['last_digit'].value_counts().sum()
prob_of_lasts = last_digit_counts / last_digit_total
prob_of_lasts

last_digit
8    0.110891
2    0.110891
1    0.104950
9    0.104950
7    0.102970
0    0.102970
6    0.100990
3    0.095050
5    0.087129
4    0.079208
Name: count, dtype: float64

**ENTER YOUR WRITTEN EXPLANATION HERE.**

My hypothesis was almost correct for the probability of he last digit being equal chance of 10% for every digit.

## Question 4

Maybe the `volume` column was just a fluke. Let's see if the first digit distribution holds up when we look at a very different variable: the closing price of the stock. Make a visualization of the first digit distribution of the closing price (the `close` column of the `DataFrame`). Comment on what you see.

(_Hint:_ What type did `pandas` infer this variable as and why? You will have to first clean the values using the [text processing functionalities](https://pandas.pydata.org/pandas-docs/stable/text.html) of `pandas` and then convert this variable to a quantitative variable.)

In [93]:
# ENTER YOUR CODE HERE.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505 entries, 0 to 504
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         505 non-null    object
 1   Name         505 non-null    object
 2   open         505 non-null    object
 3   close        505 non-null    object
 4   volume       505 non-null    int64 
 5   first_digit  505 non-null    int64 
 6   last_digit   505 non-null    int64 
dtypes: int64(3), object(4)
memory usage: 27.7+ KB


In [95]:
df['close']

0       $53.88
1      $167.78
2      $117.29
3      $116.34
4       $99.29
        ...   
500     $74.84
501     $83.98
502    $128.19
503     $54.98
504     $77.82
Name: close, Length: 505, dtype: object

In [102]:
df['close'] = df['close'].str.replace('$', '')
df['close']

0       53.88
1      167.78
2      117.29
3      116.34
4       99.29
        ...  
500     74.84
501     83.98
502    128.19
503     54.98
504     77.82
Name: close, Length: 505, dtype: object

In [104]:
df['FD_close'] = df['close'].apply(lambda x: int(str(x)[0]))
df['LD_close'] = df['close'].apply(lambda x: int(str(x)[-1]))
df

Unnamed: 0,date,Name,open,close,volume,first_digit,last_digit,FD_close,LD_close
0,2018-02-01,AAL,$54.00,53.88,3623078,3,8,5,8
1,2018-02-01,AAPL,$167.16,167.78,47230787,4,7,1,8
2,2018-02-01,AAP,$116.24,117.29,760629,7,9,1,9
3,2018-02-01,ABBV,$112.24,116.34,9943452,9,2,1,4
4,2018-02-01,ABC,$97.74,99.29,2786798,2,8,9,9
...,...,...,...,...,...,...,...,...,...
500,2018-02-01,XYL,$72.50,74.84,1817612,1,2,7,4
501,2018-02-01,YUM,$84.24,83.98,1685275,1,5,8,8
502,2018-02-01,ZBH,$126.35,128.19,1756300,1,0,1,9
503,2018-02-01,ZION,$53.79,54.98,3542047,3,7,5,8


In [107]:
prob_close_FD = df['FD_close'].value_counts() / df['FD_close'].value_counts().sum()
prob_close_LD = df['LD_close'].value_counts() / df['LD_close'].value_counts().sum()
prob_close_FD, prob_close_LD

(FD_close
 1    0.338614
 2    0.108911
 3    0.102970
 6    0.095050
 7    0.085149
 4    0.085149
 5    0.077228
 8    0.055446
 9    0.051485
 Name: count, dtype: float64,
 LD_close
 9    0.122772
 8    0.112871
 6    0.104950
 2    0.104950
 0    0.104950
 5    0.102970
 7    0.093069
 1    0.089109
 3    0.089109
 4    0.075248
 Name: count, dtype: float64)

**ENTER YOUR WRITTEN EXPLANATION HERE.**

The probabilities for close and  volume are pretty similar. They're almost idenitical but have a 1 percent difference from its most and least significant occurences. The probability of first digits are significantly different for each digit. Whereas the probability for the last digit are pretty similar for each digit

## Submission Instructions

Once you are finished, follow these steps:

1. Restart the kernel and re-run this notebook from beginning to end by going to `Kernel > Restart Kernel and Run All Cells`.

2. If this process stops halfway through, that means there was an error. Correct the error and repeat Step 1 until the notebook runs from beginning to end.

3. Double check that there is a number next to each code cell and that these numbers are in order.

Then, submit your lab as follows:

1. This quarter, you don't need to demo Lab 1. The first lab to demo will be Lab 2.

2. Upload your .ipyn Notebook to Canvas and pdf to Gradescope.