In [2]:
# run this to shorten the data import from the files
import os
path_data = os.path.join(os.path.dirname(os.getcwd()), 'datasets/')


In [3]:
import pandas as pd
import numpy as np
import scipy.stats as stats

investments_df = pd.read_csv(path_data+'investments_VC.csv')
investments_df.head()

Unnamed: 0,market,funding_total_usd,status,country_code,funding_rounds,seed,venture,equity_crowdfunding,private_equity
0,Games,4000000,operating,USA,2,0,4000000,0,0
1,Software,7000000,,USA,1,0,7000000,0,0
2,Advertising,4912393,closed,ARG,1,0,0,0,0
3,Curated Web,2000000,operating,,1,0,2000000,0,0
4,Games,41250,operating,HKG,1,41250,0,0,0


In [6]:
# exercise 01

"""
Effect size for means

Many venture capital-backed companies receive more than one round of funding. In general, the second round is bigger than the first. Just how much of an effect does the round number have on the average funding amount? You can use Cohen's d to quantify this.

Recall that, to calculate Cohen's d, you need to first calculate the pooled standard deviation. That is given by the equation

s = √((n1 - 1)s1² + (n2 - 1)s2²)/(n1 + n2 - 2)

Cohen's d is then given by:

(X_1.mean - X_2.mean)/s

A DataFrame of venture capital investments (investments_df) has been loaded for you, as have the packages pandas as pd, NumPy as np and stats from SciPy. The column funding_total_usd shows the total funding received in that round.
"""

# Instructions

"""

    Filter investments_df to select funding_rounds 1 and 2 separately.
    Calculate the standard deviation and sample size of each round.
    Calculate the pooled standard deviation between the two rounds.
    Calculate Cohen's d using the terms you just calculated.

"""

# solution

# Select all investments from rounds 1 and 2 separately
round1_df = investments_df[investments_df['funding_rounds'] == 1]
round2_df = investments_df[investments_df['funding_rounds'] == 2]

# Calculate the standard deviation of each round and the number of companies in each round
round1_sd = round1_df['funding_total_usd'].std()
round2_sd = round2_df['funding_total_usd'].std()
round1_n = round1_df.shape[0]
round2_n = round2_df.shape[0]

# Calculate the pooled standard deviation between the two rounds
pooled_sd = np.sqrt(((round1_n - 1) * round1_sd ** 2 + (round2_n - 1) *round2_sd ** 2) / (round1_n + round2_n - 2))

# Calculate Cohen's d
d = (round1_df['funding_total_usd'].mean() - round2_df['funding_total_usd'].mean()) / pooled_sd

print(d)

#----------------------------------#

# Conclusion

"""
If you printed out d in the console you'll see it's only about 0.08. That's a surprisingly low value! That tells us that moving to a second round of funding does not in itself have a large effect on the amount of money raised. This is likely due to how large the standard deviations are, which means that the means are unreliable estimates.
"""

-0.07719192881235956


"\nIf you printed out d in the console you'll see it's only about 0.08. That's a surprisingly low value! That tells us that moving to a second round of funding does not in itself have a large effect on the amount of money raised. This is likely due to how large the standard deviations are, which means that the means are unreliable estimates.\n"

In [7]:
btc_sp_df = pd.read_csv(path_data+'btc_sp.csv')
btc_sp_df.head()

Unnamed: 0,Date,Open_BTC,High_BTC,Low_BTC,Close_BTC,Close_SP500,Open_SP500,High_SP500,Low_SP500
0,2017-08-07,3212.780029,3397.679932,3180.889893,3378.939941,2480.91,2477.14,2480.95,2475.88
1,2017-08-08,3370.219971,3484.850098,3345.830078,3419.939941,2474.92,2478.35,2490.87,2470.32
2,2017-08-09,3420.399902,3422.76001,3247.669922,3342.469971,2474.02,2465.35,2474.41,2462.08
3,2017-08-10,3341.840088,3453.449951,3319.469971,3381.280029,2438.21,2465.38,2465.38,2437.75
4,2017-08-11,3373.820068,3679.719971,3372.120117,3650.620117,2441.32,2441.04,2448.09,2437.85


In [8]:
# exercise 02

"""
Effect size for correlations

The volatility of an asset is roughly defined by how much its price changes. In this exercise you'll measure volatility on a per-day basis, defined as the (high price - low price) / closing price.

What factors explain the volatility of Bitcoin? Is the volatility of the S&P500 closely related to this? Does volatility increase or decrease as prices rise? In other words, what is the effect size of the correlation between these different factors? You'll compute both of these effect size in this exercise.

A DataFrame of S&P 500 and Bitcoin prices (btc_sp_df) has been loaded for you, as have the packages pandas as pd, NumPy as np, Matplotlib as plt, and stats from SciPy.
"""

# Instructions

"""
Compute the volatility of BTC.
Repeat for the S&P 500.
Compute R² between the volatility of each asset.
Compute R² between the volatility and closing price of BTC.
---
Question

Neither the volatility of the S&P 500 nor the closing price of BTC has a large correlation with the volatility of BTC. However, you still learn something from the correlations! You can see that the volatility of the S&P 500 explains about 3% of the variation of volatility of BTC, while the closing price of BTC explains about 1%. Therefore, price swings in BTC aren't simply related to price swings in the S&P 500, nor in the price of BTC being especially high/low. These sorts of quantitative assessments are important to conduct.

Next, you'll turn to inference. Which of the following statements are supported by the effect size calculations you just ran?
[The volatility in the S&P 500 has the greater effect on the volatility of Bitcoin.]
"""

# solution

# Compute the volatility of Bitcoin
btc_sp_df['Volatility_BTC'] = (btc_sp_df['High_BTC'] - btc_sp_df['Low_BTC']) / btc_sp_df['Close_BTC']

# Compute the volatility of the S&P500
btc_sp_df['Volatility_SP500'] = (btc_sp_df['High_SP500'] - btc_sp_df['Low_SP500']) / btc_sp_df['Close_SP500']

# Compute and print R^2 between the volatility of BTC and SP500
r_volatility, p_value_volatility = stats.pearsonr(btc_sp_df['Volatility_BTC'], btc_sp_df['Volatility_SP500'])
print('R^2 between volatility of the assets:', r_volatility**2)

# Compute and print R^2 between the volatility of BTC and the closing price of BTC
r_closing, p_value_closing = stats.pearsonr(btc_sp_df['Volatility_BTC'], btc_sp_df['Close_BTC'])
print('R^2 between closing price and volatility of BTC:', r_closing**2)

#----------------------------------#

# Conclusion

"""
While examining things like line graphs and scatter plots can help you gain an intuitive understanding of the data, rigorous statistical tests are a necessity. They take beyond 'I feel like...' and towards 'The data shows that...'. These tools are key in principled decision making.
"""

R^2 between volatility of the assets: 0.03152987723111431
R^2 between closing price and volatility of BTC: 0.01252065913517721


"\nWhile examining things like line graphs and scatter plots can help you gain an intuitive understanding of the data, rigorous statistical tests are a necessity. They take beyond 'I feel like...' and towards 'The data shows that...'. These tools are key in principled decision making.\n"

In [11]:
employees_df = pd.read_csv(path_data+'employees_df.csv')
employees_df.head()

Unnamed: 0,Title,Asian,Black or African American,Hispanic or Latino,White
0,Administrative Specialist,5,34,99,78
1,Fire Specialist,8,9,36,149
2,Firefighter,5,37,127,361
3,"MuniProg, Paraprofessional",37,142,227,493
4,Police Corporal/Detective,5,31,77,263


In [15]:
# exercise 03

"""
Effect size for categorical variables

You saw in the City of Austin employee data that job titles have an unequal distribution of genders. But does the same thing hold for ethnicities? And to what extent does ethnicity relate to the job title chosen? In this exercise you'll dig in and answer that question.

A DataFrame of comparing job titles and ethnicities (employees_df) has been loaded for you, as have the packages pandas as pd, NumPy as np, and stats from SciPy.
"""

# Instructions

"""

    Compute the chi-squared statistic from the contingency table employees_df.
    Compute the degrees of freedom for Cramer's V.
    Compute the total number of people in the contingency table.
    Compute Cramer's V using the equation from the video.
---
Question

While a chi-squared test can test for association between two categorical variables, it doesn't directly answer the question of the degree of association. By computing an effect size like Cramer's V, you can directly measure the effect that one variable has on the other.

Now, you'll turn to inference. What can you conclude from the value of Cramer's V shown in the console?
[Ethnicity has a moderate effect on the job title a person holds.]
"""

# solution

# Compute the chi-squared statistic
chi2, p, d, expected = stats.chi2_contingency(employees_df.iloc[:,1:])

# Compute the DOF using the number of rows and columns
dof = min(employees_df.shape[0] - 1, employees_df.shape[1] - 1)

# Compute the total number of people
n = np.sum(employees_df.iloc[:,1:].values)

# Compute Cramer's V
v = np.sqrt((chi2 / n) / dof)

print("Cramer's V:", v, "\nDegrees of freedom:", dof)

#----------------------------------#

# Conclusion

"""
Unlike R-squared, Cramer's V doesn't have an easy interpretation. But by knowing the degrees of freedom you can get a general feel for how large or small an effect one categorical variable has on another. Here you see that job title and ethnicity are certainly related, but not directly linked. This might lead you to search for what other factors are influencing the job a person gets.
"""

Cramer's V: 0.11173756740290954 
Degrees of freedom: 4


"\nUnlike R-squared, Cramer's V doesn't have an easy interpretation. But by knowing the degrees of freedom you can get a general feel for how large or small an effect one categorical variable has on another. Here you see that job title and ethnicity are certainly related, but not directly linked. This might lead you to search for what other factors are influencing the job a person gets.\n"

In [18]:
police_salaries_df = pd.read_csv(path_data+'police_salaries_df.csv')
police_salaries_df.head()

Unnamed: 0,Title,Gender,Ethnicity,Annual Salary,Years of Employment
0,Police Officer,M,White,72681.44,6
1,Police Officer,M,White,83210.4,15
2,Police Officer,M,White,83210.4,13
3,Police Officer,M,White,72681.44,5
4,Police Officer,M,Hispanic or Latino,95270.24,20


In [21]:
# exercise 04

"""
Multiple comparisons problem

The multiple comparisons problem arises when a researcher repeatedly checks different variables/samples against one another for significance. Just by random chance we expect to find an occasional result of statistical significance.

In this exercise you'll work with data from salaries for employees at the City of Austin, TX. You will compare their salaries against randomly generated data. You will see how often this random data is "significant" in explaining the salaries of employees. Clearly any such "significance" would be spurious, as random numbers aren't very helpful in explaining anything!

A DataFrame of police officers salaries (police_salaries_df) has been loaded for you, as have the packages pandas as pd, NumPy as np, Matplotlib as plt, and stats from SciPy.
"""

# Instructions

"""

    Store the number of people in the dataset in n_rows (each row is a person), and initialize the number of significant results, n_significant, to zero.
    Write a for loop which runs 1000 times and generates n_rows random numbers.
    Compute Pearson's R and the associated p-value between these randomly generated numbers and the police officer salaries.
    If the p-value is significant at 5%, add one to n_significant using the += operator.

"""
p_values = []
# solution

# Compute number of rows and initialize n_significant
n_rows = police_salaries_df.shape[0]
n_significant = 0

# For loop which generates n_rows random numbers 1000 times
for i in range(1000):
  random_nums = np.random.uniform(size=n_rows)
  # Compute correlation between random_nums and police salaries
  r, p_value = stats.pearsonr(police_salaries_df['Annual Salary'], random_nums)
  # If the p-value is significant at 5%, increment n_significant
  p_values.append(p_value)
  if p_value < 0.05:
    n_significant += 1
    
print(n_significant)

#----------------------------------#

# Conclusion

"""
Notice how about 50 out of 1000, or 5%, of the results were significant. This is not a coincidence! Your cutoff for significance was 5%, meaning about 5% of the time a random correlation will cross this threshold. This clearly demonstrates the problem with repeatedly running experiments and assuming that low p-values indicate something meaningful.
"""

46


'\nNotice how about 50 out of 1000, or 5%, of the results were significant. This is not a coincidence! Your cutoff for significance was 5%, meaning about 5% of the time a random correlation will cross this threshold. This clearly demonstrates the problem with repeatedly running experiments and assuming that low p-values indicate something meaningful.\n'

In [27]:
# exercise 05

"""
Bonferonni-Holm correction

You've seen that comparing many different datasets, even randomly generated ones, can result in "statistically significant relationships" that are anything but! One way around this is to apply a correction to the alpha of your confidence level. In this exercise you'll explore why you should apply this correction and how to do so.

The 1000 p-values you calculated in the previous exercise have been loaded for you in a NumPy array p_values, as has the package NumPy as np.
"""

# Instructions

"""

    Compute the Bonferonni-corrected value of alpha = 5%.
    Print out how many of the p-values were less than this corrected cutoff.
---
Question

What a drastic change from the last exercise! Previously you had quite a few spurious correlations being detected. Now you likely had none, or very close! This means that the bar for what is significant has been raised drastically. This helps reduce the chance of spurious correlations being marked at significant.

When you used the uncorrected alpha, about 50 of the 1000 experiments (or 5% of them) were "significant". When using the Bonferonni-Holm correction, none (or close to none) of them were significant. Which of the following options explains why this happened?
[The Bonferonni-Holm correction reduces the probability of rejecting the null proportional to the number of experiments performed.]

---

Question

Which of the following explains when you should use the Bonferonni-Holm correction when making inference?
[Any time you wish to make many different comparisons to perform inference.]

"""

# solution

# Compute the Bonferonni-corrected alpha
bonf_alpha = 0.05/1000
p_values = np.array(p_values)
# Check how many p-values were significant at this level
print(sum(p_values < bonf_alpha))

#----------------------------------#

# Conclusion

"""
Beyond knowing about statistical tools, it's important to know exactly when to apply them. By understanding that the role of the Bonferonni correction is to minimize the chance of spurious correlations being deemed statistically significant, you know exactly when and where to use it!
"""

0


"\nBeyond knowing about statistical tools, it's important to know exactly when to apply them. By understanding that the role of the Bonferonni correction is to minimize the chance of spurious correlations being deemed statistically significant, you know exactly when and where to use it!\n"

### What is power anyway?

Some of the wording around power can be quite confusing. For example, the power of a two-sample t-test is:

*The probability of rejecting the null in favor of the alternative, assuming there is indeed a difference in population means between the two groups.*

That's quite a mouthful!

Imagine you are analyzing a marketing campaign and want to infer if it increases customer spending. You conduct a two-sample t-test (customers exposed to the campaign versus customers not exposed) to compare the mean dollar amounts spent for each group.

![Categories](/home/nero/Documents/Estudos/DataCamp/Python/courses/Foundations_of_Inference_in_Python/ch03.png)


**Being able to translate between statistical language and concrete examples is an incredibly important skill. Notice how sample and population means can be re-expressed as groups of customers and all possible customers, tests can be re-expressed as looking for significant differences in samples, and power can be expressed as the chance of detection.**

### Power for experimental design

Imagine that you collected a sample of 100 people for your study and spend time and money executing it. After your study was done, you realized that the power of your test was only 10%. In other words, even if there was a difference between your groups, there is only a 10% chance your test would detect it, given the data you supplied. What a waste of effort!

Therefore, the best practice is to estimate power before collecting data and running an experiment. In this exercise you'll organize the steps followed in this process. The items below are steps in the process of experimental design using power analysis.

![Solution](/home/nero/Documents/Estudos/DataCamp/Python/courses/Foundations_of_Inference_in_Python/ch03_02.png)

**Notice how following this process lays out a blueprint for conducting your analysis. By following a principled approach you are able to collect a sample that is likely to help you test your hypothesis in a principled way.**

In [45]:
games = investments_df[investments_df['market'] == 'Games']
ads = investments_df[investments_df['market'] == 'Advertising']


# Calculate the standard deviation of each round and the number of companies in each round
games_sd = games['funding_total_usd'].std()
ads_sd = ads['funding_total_usd'].std()
games_n = games.shape[0]
ads_n = ads.shape[0]

# Calculate the pooled standard deviation between the two rounds
pooled_sd = np.sqrt(((games_n - 1) * games_sd ** 2 + (ads_n - 1) * ads_sd ** 2) / (games_n + ads_n - 2))

# Calculate Cohen's d
d = (games['funding_total_usd'].mean() - ads['funding_total_usd'].mean()) / pooled_sd

display(d)

-0.10123429934046722

In [46]:
# exercise 06

"""
Computing power and sample sizes

You want to compare average funding per investment round for games companies versus advertising companies. You begin by computing the power of a two sample t-test.

Three things have been loaded for you:

    ads_acquired_avg_funding - (pandas Series) The average funding of each acquired advertising company
    games_acquired_avg_funding - (pandas Series) Same, but for game companies
    ads_games_cohensd - (Number) Cohen's d (standardized effect size) for advertising versus games company's

The function needed to compute power has been imported from statsmodels.stats.power.
"""

# Instructions

"""

    Compute the ratio of games companies (games_n) to advertising companies (ads_n).
    Compute power with an effect size ads_games_cohensd, alpha of 5%, number of advertising companies ads_n as nobs1, and the games_ads_ratio above.
---

    Solve for the sample size nobs1 needed to achieve a power of 80% with an alpha of 5% and an effect size of ads_games_cohensd by setting nobs1 equal to None in solve_power().
    Print nobs1, the number of observations in one group (e.g. ads).

"""

# solution

from statsmodels.stats.power import TTestIndPower

# Compute the ratio of games to advertising companies
games_ads_ratio = games_n / ads_n

TTestIndPower().power(effect_size=d, 
                      nobs1=ads_n,
                      alpha=0.05,
                      ratio=games_ads_ratio)

# Solve for the sample size needed to achieve a power of 80%
nobs1 = TTestIndPower().solve_power(effect_size=d, nobs1=None, alpha=0.05, power=0.8)

# Print the number of participants needed in one group
print(nobs1)

#----------------------------------#

# Conclusion

"""
Power can be a tricky topic, great job mastering it! You've been able to use solve_power() to identify that you need nearly 14,000 participants in each group! While much of the techniques you've learned so far focus on simply running a test and hoping for the best, power analysis allows you to understand just how likely your test is to be successful.
"""

1532.6875406057334


"\nPower can be a tricky topic, great job mastering it! You've been able to use solve_power() to identify that you need nearly 14,000 participants in each group! While much of the techniques you've learned so far focus on simply running a test and hoping for the best, power analysis allows you to understand just how likely your test is to be successful.\n"