## What is Data Science?

- This is a good video summarizing data science : https://www.youtube.com/watch?v=KdgQvgE3ji4&t=217s

<img src="Images/DS_process.png" width="600" height="600">

## Data Science Road Map at Make School

- https://docs.google.com/document/d/1dtMNJRDto5cWPLJv0J4eGxv_rv__20ZKsd9RaXU0ioM/edit

## What is Exploratory Data Analysis (EDA)?

- Refers to the **critical process of performing initial investigations on data**. This allows us to discover patterns, spot anomalies, test our hypothesis, and check assumptions with the help of summary statistics and graphical representations

- EDA has three components:

    - **Data analysis**. Example: the telecom churn dataset is given, we want to obtain what is the percentage of loyal and non-loyal customers, which state has the highest churn customers, What is the maximum length of international calls among loyal users (Churn == 0) who do not have an international plan?
    
    - **Data analysis and visualization**. Normally comparison is shwon by plotting results which is easier for humans than reading raw numbers
    
    - **Statistical analysis on the data**. Example: the titanic dataset is given, at specific "Embark" what is the distibution of male passengers over all age (were they mainly young, middle age or old)
    
<img src="Images/DA_process.png" width="600" height="600">

## Software and Tools For EDA

- For now, you’ll only need [Anaconda](https://www.anaconda.com/distribution/) (built with Python 3.6)

- After Anaconda, we mainly use:

    - [pandas](https://pandas.pydata.org/), which is a Python library that provides extensive means for data analysis
    
    - [matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/), for data visualization
    
    - [scipy](https://www.scipy.org/) and [statsmodels](https://www.statsmodels.org/stable/index.html) for statistical data analysis
    
<img src="Images/DA_tools.png" width="600" height="600">

## What is Jupyter Notebook?

<img src="Images/what_is_jupyter_notebook.png" width="600" height="600">

## Which data career makes more money in London?

This is an example of Statistical Data Analysis

We'll review [this tutorial](https://dashee87.github.io/data%20science/r/Engineering-Data-Engineers/) that shows you how to scrape the Indeed API for all junior/senior positions for Data Engineer, Data Analyst and Data Scientist.

<img src="Images/DE_DA_DS.png" width="500" height="500">

Later in this course, you'll know how to follow collect, analyze, and visualize data like they do in this tutorial. You will also learn how to represent the salaries in Cumulative Density Function (CDF)



## Resources we can use to pull/access data

- Databases, writing SQL queries for example 

- CSV files

- Calling APIs, such as the previously shown Indeed API

- Crawling (Scrape) websites 

# Setting Up Your Environment

For this class you will need to install the following Python packages: 
- pip3 install pandas
- pip3 install seaborn
- pip3 install matplotlib
- pip3 install statsmodels
- pip3 install scipy
- pip3 install scikit-learn==0.21.3
- pip3 install notebook

All of the material in this course will be presented using jupyter notebooks. If you want to use the notebooks and follow along you can do one of three options:
1. copy the notebook files from the course repo (.ipynb files)
1. Clone the course repo
1. Use [nbviewer and Binder](https://www.tutorialspoint.com/jupyter/sharing_jupyter_notebook_using_github_and_nbviewer.htm)

To run jupyter notebooks locally naviagte to your notebook directory in the Terminal and type: jupyter notebook Data_Analysis_Intro.ipynb 

# Jupyter Basics

- Jupyter notebooks consist of components called cells
- Cells can be configured as Markdown or runnable Python code
- You can use the GUI or there are several [keyboard shortcuts](https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330)
- Shift + Enter will run a cell while Enter will just go to the next line

## Activity: How many female passengers survived the Titanic?

- Dataset Description: https://www.kaggle.com/andyyang/titanic-survival-project

- The intention of this activity is to show that by using appropriate data science packages, we can accelarate our analysis and modeling of our data 

In [1]:
# without using pandas
import csv

with open('titanic.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    # set the counter
    c = 0
    # iterate over rows 
    for row in csv_reader:
        # if passenger is female and if she survived, increase the counter
        if (row[1] == '1') & (row[4] == 'female'):
            c += 1
print(c)

233


In [2]:
# Using pandas

import pandas as pd

df = pd.read_csv('titanic.csv')

len(df[(df['Sex'] == 'female') & (df['Survived'] == 1)])

233

# Pandas Dataframe

The Pandas DataFrame, often abbreviated as df, is a 2-D labeled data structure with columns of potentially different type. It let's us easily select and manipulate data.

You can think of a dataframe as a programmatic excel spreadsheet. [Here](https://towardsdatascience.com/pandas-dataframe-a-lightweight-intro-680e3a212b96) is a good intro to pandas dataframes. 

## Activity: Live Demo

Let's use the cell above to explore some fundamental dataframe methods and properties.

## Applications of Data Analysis

Here are some example applications of Data Analysis in the real world:
    
1. Is the new product or website better than the previous product or website?

1. If we have a crime dataset for a given city, which region in the city is the most violence part? At what time of the day? Which season? What type of crime is happening the most?


### Slack Activity

Go to the DS 1.1 slack channel and list an application of Data Analysis in the real world that you would like to see, or would like to create as part of this course


## What do you want to get out of this course?

What are two topics/items/things you want to learn from this course? Submit them through [this form](https://forms.gle/qDobsKhuPiga9MT7A), and we will do our best to make sure those topics/items are covered!

## Optional Reading 

In [3]:
# Another way to count number of survived female passenger in Titanic without using pandas

import csv

with open('titanic.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    gender_ls = []
    survive_ls = []
    for row in csv_reader:
        gender_ls.append(row[4])
        survive_ls.append(row[1])
        
gender_ls = gender_ls[1:]  
survive_ls = survive_ls[1:]

female_index = [i for i, j in enumerate(gender_ls) if j == 'female']
# print(female_index)
female_survived_not_survived = [survive_ls[i] for i in female_index]

num_female_survived = len([i for i, j in enumerate(female_survived_not_survived) if j == '1'])
print(num_female_survived)

233
