# Data Analysis in Python
## Statistics in Python



### Load packages

To begin your analysis, import the main Python packages for data analysis.
Import `numpy` for numerical operations, `pandas` for handling data frames. Give them the alias `np`, `pd`, respectively. 
From package `scipy`, import module `stats` for statistical functions.
Use the `from`, `import` and `as` keywords.

Here is some [support material](https://openclassrooms.com/en/courses/6900846-set-up-a-python-environment/6989471-import-python-packages-and-modules) to help with the task.

---


In [None]:
import __ as __ 
import __ as __
from __ import __

**Optional:** Set random seed, for reproducibility

Setting a random seed ensures that your results are reproducible. Use the `seed()` method from the `numpy.random` module.  
More information about : [numpy.random.seed documentation](https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html)

---

In [None]:
np.random.seed(__) #choose a number

### Making a mock data frame.

Use a dictionary data stucture and fill it with data!\
We have 3 indexes: *"group"*, *"score1"* and *"score2"*.
+ In *"group"*, set the group names that will bet repeated 30 times each.
+ With `np.concate`, bind the next two arrays.
+ Comment what each parameter inside `np.random.normal` means.
+ Follow the example in *"score1"* to set the values in *"score2"*. 
+ Use `pd.DataFrame` to organize your data, including one categorical variable and two numeric variables.  

More info? See:

- [Using numpy.concatenate()](https://www.geeksforgeeks.org/machine-learning/numpy-concatenate-function-python/)
- [Using numpy.random.normal()](https://www.datacamp.com/doc/numpy/random-normal)  
- [Tips on building data frames with `pandas`](https://www.w3schools.com/python/pandas/pandas_dataframes.asp)

---

In [None]:
data = {
    'group': ['_'] * 30 + ['_'] * 30, #substitute underscore for one category each. E.g.: "A" or "B"
    'score1': np.concatenate([
        np.random.normal(loc = 50, scale = 10, size = 30) # parameters for mean, standard deviation and sample size
        np.random.normal(__, __, __) 
    ]),
    'score2': np._____([
        np._______(__,__,__),
        np._______(__,__,__)
    ])
}
df = pd.___(data) #turn the dictionary into a DataFrame
df.head() #method to show the top rows of a DataFrame

### Save data table

Save your DataFrame to a CSV file using the `to_csv()` method from the `pd.DataFrame` class.  
See: [pandas.DataFrame.to_csv documentation](https://www.geeksforgeeks.org/pandas/saving-a-pandas-dataframe-as-a-csv/)

---

In [None]:
df.to_csv("name_your_file.csv", index=False) #index, in this case, is also known as row names

### Test Assumptions

Test statistical assumptions such as normality and equality of variances. Use the `stats.shapiro()` method for normality and `stats.levene()` for variance equality from the `scipy.stats` module.\

But first, how to select part of the data frame:

```
df['group'] # you can use columns names to select
df['group'] == 'A' # you can also perform operations logical...
df[df['group'] == 'A'] #... to select a specific part of your data frame
df[df['group'] == 'A']['score1'] #and then you can, again, select just a part of it

```

More info:

- [Indexing Dataframes](https://www.dataquest.io/blog/tutorial-indexing-dataframes-in-pandas/)
- [Performing Shapiro-Wilk Test](https://www.statology.org/shapiro-wilk-test-python/)  
- [Performing Levene's Test](https://www.statology.org/levenes-test-python/)

---

In [None]:
#
df_group1 = df[df['group'] == '_']['score1'] #select group value
df_group2 = df[df['group'] == '_']['score1'] #select the other group value

# Check normality for each group
stats.shapiro(df_group1)
stats.shapiro(_________) #complete the code

# Check variance equality
stats.levene(________________, _________________)#insert both dataframe pieces

### Using T-test, Welsh test or U Test

Some methods and functions, like the ones we use in the following cell, returns list of values. You can save a variable only with the value that matters by *indexing* or *.label*


In [None]:
print(stats.ttest_ind(df_group1, df_group2)) #give use the result "statistic" and "pvalue"
print(stats.ttest_ind(df_group1, df_group2)[0]) #return the first element of the list
t_p = stats.ttest_ind(df_group1, df_group2).pvalue #return the element named "pvalue"
print("T-test p-value:", t_p)

Now compare groups using statistical tests. Use `stats.ttest_ind()` for t-tests, specifying the `stats.equal_var` parameter for Welchâ€™s test, and `mannwhitneyu()` for non-parametric comparison, all from `scipy.stats`. Use `print` to check only the `.pvalue`

More info:  
- [Significance Tests](https://www.w3schools.com/python/scipy/scipy_statistical_significance_tests.php)

---

In [None]:
# Welch's t-test (does not assume equal variances)
t_p_welch = stats.ttest_ind(___, equal_var=___).pvalue #tip, is a boolean value
print("Welch's t-test p-value:", ___)

# Mann-Whitney U test (non-parametric)
u_p = stats.mannwhitneyu(____)___
print("Mann-Whitney U test p-value:", __)

### Linear Regression

Fit a linear regression model to your data using the `linregress()` function from the `scipy.stats` module. This function will provide regression statistics such as slope and intercept.  

But before, let's learn another coding trick:

You learned some functions return more than a value. So, if you want to save all values in different variables, would you need to run the same function over and over again?
No, because um can assing multiple variables at once! You just need to keep the assigned values in other with the functions index.

```
# Instead of
t_stat = stats.ttest_ind(df_group1, df_group2)[0]
t_p = stats.ttest_ind(df_group1, df_group2)[1]

# You can
t_stat, t_p = stats.ttest_ind(df_group1, df_group2)
```
In the case of `linregress()`, the returend values are the ones following this order: slope, intercept, rvalue, pvalue, stderr, intercept_stderr


See more info:\
+ [Linear regression in Python](https://www.w3schools.com/python/python_ml_linear_regression.asp)

---

In [None]:
# Simple regression: score ~ group (encoded as 0/1)
X = df['score1']
y = df['score2']

_____ = stats.linregress(X, y) #assing all values
print("Intercept:", __)
print("Slope:", __)
print("Pearson's R:", __)
print("p-value:", __)
print("Standard Error:", __)
print("Intercept Standard Error:", __)
