# Index:

- [Introduction to pandas](#Introduction-to-pandas)
    - [Python native list](#Python-native-list)
    - [Numpy array](#Numpy-array)
    - [Pandas DataFrame](#Pandas-DataFrame)
    - [Standard Alias](#Standard-Alias) 
- [Creating a DataFrame](#Creating-a-DataFrame)
    - [Create a DataFrame drom a dictionary](#Create-a-DataFrame-drom-a-dictionary)
    - [Change the index](#Change-the-index)
    - [Load data from an external source](#Load-data-from-an-external-source)
        - [Load a DataFrame from a file](#Load-a-DataFrame-from-a-file)
        - [Use fullname as the index](#Use-fullname-as-the-index)
        - [Load a DataFrame from online data](#Load-a-DataFrame-from-online-data)
        - [DataFrame summaries](#DataFrame-summaries)
- [Overview: The 'Black Box' Metaphor For Machine Learning](#Overview:-The-'Black-Box'-Metaphor-For-Machine-Learning)
- [Overview: Introduction to Statistical Analysis](#Overview:-Introduction-to-Statistical-Analysis)
- [Mini-Lesson 1.1: Calculating Expected Value and Variance](#Mini-Lesson-1.1:-Calculating-Expected-Value-and-Variance)
    - [Expected value](#Expected-value)
        - [An example](#An-example)
    - [Expected variance](#Expected-variance)
- [Mini-Lesson 1.2: The Basics of Using pandas](#Mini-Lesson-1.2:-The-Basics-of-Using-pandas)
    - [Codio Activity 1.1: Python Coding](#Codio-Activity-1.1:-Python-Coding)
        - [Problem 1: Creating a DataFrame](#Problem-1:-Creating-a-DataFrame)
        - [Problem 2: Create a dataframe from url](#Problem-2:-Create-a-dataframe-from-url)
        - [Problem 3: Create a dataframe by importing data from an file](#Problem-3:-Create-a-dataframe-by-importing-data-from-an-file)  
    - [Codio Activity 1.2: Exploring a DataFrame](#Codio-Activity-1.2:-Exploring-a-DataFrame)
        - [Problem 1: Describing the data](#Problem-1:-Describing-the-data)
        - [Problem 2: Getting the missing data information](#Problem-2:-Getting-the-missing-data-information)
        - [Problem 3: Examining top 10 rows](#Problem-3:-Examining-top-10-rows)
    - [Codio Activity 1.3: Selecting Data in Multiple Ways](#Codio-Activity-1.3:-Selecting-Data-in-Multiple-Ways)      
        - [Part one: loc[]](#Part-one:-loc[])
        - [Part two: iloc[]](#Part-two:-iloc[])
        - [Part three:Split](#Part-three:Split)
    - [Discussion 1.2: Data Repositories](#Discussion-1.2:-Data-Repositories)
- [Mini-Lesson 1.3: Analyzing Data Using pandas](#Mini-Lesson-1.3:-Analyzing-Data-Using-pandas)
    - [Division](#Division)
    - [Summation](#Summation)
    - [Percentage](#Percentage)
- [Mini-Lesson 1.4: What Are Histograms and Why Do They Matter?](#Mini-Lesson-1.4:-What-Are-Histograms-and-Why-Do-They-Matter?)
    - [Histograms](#Histograms)
    - [Data shapes](#Data_shapes)
    - [Skewed data](#Skewed_data)
    - [Outliers](#Outliers)
    - [Discussion 1.3: Creating Visualizations Using Personally-Sourced Data](#Discussion-1.3:-Creating-Visualizations-Using-Personally-Sourced-Data)
- [Glossary](#Glossary)

<h1><center>Introduction to pandas</center></h1>
Pandas adds functionaly to python for operating on tabular data and we have already seen a few things about the tables:

- That tables are a common format for datasets in machine learning applications 
- Rows represent samples from a multivariable distribution
- Columns represent features or sinble random variables. So, why do we need pandas? There are another types of list available in Python like:
# Python native list
The difference is that Python native list are heterogeneous, meaning that lists can hold different data types. 

In [77]:
 [0.1, 0.2, 'a', 'b']

[0.1, 0.2, 'a', 'b']

# Numpy array
This numpy list is homogeneos list, meaning that lists contain only one data type, because homogeneity is efficient for numerical computation, it is preferred for ML applications.

In [78]:
import numpy as np

np.array([0.1, 0.2, 0.3, 0.4])

array([0.1, 0.2, 0.3, 0.4])

# Pandas DataFrame
So again, Whydo we need pandas?
Well, pandas takes this one step further and offers a dataframe type, wich is ecactly what we need for tabular data. We see that it has three columns and each column is homogeneous in itself.
<img src="Images/11.png">

# Standard Alias
Like many packages in Python, pandas has a standad alias, which is '**pd**'

In [79]:
import pandas as pd

<h1><center>Creating a DataFrame</center></h1>

# Create a DataFrame drom a dictionary
Create a DataFrame from the data stored in the memory on your computer or by importing the data from external sourse

In [80]:
data = {'A':[25,56,93] , 'B':['str1','str2','str3']}

In [81]:
X = pd.DataFrame(data)
X

Unnamed: 0,A,B
0,25,str1
1,56,str2
2,93,str3


# Change the index
- Pass an index to the constructor with the **index** input argument,

In [82]:
X = pd.DataFrame(data, index=['row0','row1','row2'])
X

Unnamed: 0,A,B
row0,25,str1
row1,56,str2
row2,93,str3


or use ser_index() to assign a column to be the index,

In [83]:
X = pd.DataFrame(data)
X = X.set_index('A')
X

Unnamed: 0_level_0,B
A,Unnamed: 1_level_1
25,str1
56,str2
93,str3


- or keep the default index (list of integers)

In [84]:
X = pd.DataFrame(data)
X

Unnamed: 0,A,B
0,25,str1
1,56,str2
2,93,str3


# Load data from an external source
Pandas provides many read functions that we can list out by typing **"pd.read_"** pressing tab, and this activates the autocomplete function of Jupyter.

## Load a DataFrame from a file

In [86]:
filename= 'DataSets/celebrity-heights.csv'
CH = pd.read_csv(filename)
CH

Unnamed: 0,id,firstname,midname,lastname,fullname,ftin,feet,inches,meters,gender
0,1,Verne,,Troyer,Verne Troyer,2ft 8in,2,8.00,0.81280,M
1,8,Herve,,Villechaize,Herve Villechaize,3ft 10in,3,10.00,1.16840,M
2,9,David,,Rappaport,David Rappaport,3ft 11in,3,11.00,1.19380,M
3,2,Tony,,Cox,Tony Cox,3ft 6in,3,6.00,1.06680,M
4,3,Warwick,,Davis,Warwick Davis,3ft 6in,3,6.00,1.06680,M
...,...,...,...,...,...,...,...,...,...,...
5497,5499,General,,Height,General Height,7ft 6.5in,7,6.50,2.29870,
5498,5498,Matthew,,McGrory,Matthew McGrory,7ft 6in,7,6.00,2.28600,M
5499,5500,Sandy,,Allen,Sandy Allen,7ft 7.25in,7,7.25,2.31775,F
5500,5501,Sun,Ming,Ming,Sun Ming Ming,7ft 8.75in,7,8.75,2.35585,


## Use fullname as the index

In [87]:
CH2 = CH.set_index('fullname')
CH2

Unnamed: 0_level_0,id,firstname,midname,lastname,ftin,feet,inches,meters,gender
fullname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Verne Troyer,1,Verne,,Troyer,2ft 8in,2,8.00,0.81280,M
Herve Villechaize,8,Herve,,Villechaize,3ft 10in,3,10.00,1.16840,M
David Rappaport,9,David,,Rappaport,3ft 11in,3,11.00,1.19380,M
Tony Cox,2,Tony,,Cox,3ft 6in,3,6.00,1.06680,M
Warwick Davis,3,Warwick,,Davis,3ft 6in,3,6.00,1.06680,M
...,...,...,...,...,...,...,...,...,...
General Height,5499,General,,Height,7ft 6.5in,7,6.50,2.29870,
Matthew McGrory,5498,Matthew,,McGrory,7ft 6in,7,6.00,2.28600,M
Sandy Allen,5500,Sandy,,Allen,7ft 7.25in,7,7.25,2.31775,F
Sun Ming Ming,5501,Sun,Ming,Ming,7ft 8.75in,7,8.75,2.35585,


## Load a DataFrame from online data
- We will load dataon educational attainment and personal income from the [California Open Data portal](https://data.ca.gov/dataset/ca-educational-attainment-personal-income/resource/26201f19-4469-4311-a819-bbbd3e557eda).
- Go to the website, copy the URL and enter it below.

In [88]:
url = 'https://data.ca.gov/dataset/cea8cd18-9d21-4676-85de-d504ee2d4aab/resource/26201f19-4469-4311-a819-bbbd3e557eda/download/ca-educational-attainment-personal-income-2008-2014.csv'
X = pd.read_csv(url)
X

Unnamed: 0,Year,Age,Gender,Educational Attainment,Personal Income,Population Count
0,01/01/2008 12:00:00 AM,00 to 17,Male,Children under 15,No Income,
1,01/01/2008 12:00:00 AM,00 to 17,Male,No high school diploma,No Income,650889.0
2,01/01/2008 12:00:00 AM,00 to 17,Male,No high school diploma,"$5,000 to $9,999",30152.0
3,01/01/2008 12:00:00 AM,00 to 17,Male,No high school diploma,"$10,000 to $14,999",7092.0
4,01/01/2008 12:00:00 AM,00 to 17,Male,No high school diploma,"$15,000 to $24,999",3974.0
...,...,...,...,...,...,...
1055,01/01/2014 12:00:00 AM,65 to 80+,Female,Bachelor's degree or higher,"$15,000 to $24,999",82988.0
1056,01/01/2014 12:00:00 AM,65 to 80+,Female,Bachelor's degree or higher,"$25,000 to $34,999",59607.0
1057,01/01/2014 12:00:00 AM,65 to 80+,Female,Bachelor's degree or higher,"$35,000 to $49,999",113584.0
1058,01/01/2014 12:00:00 AM,65 to 80+,Female,Bachelor's degree or higher,"$50,000 to $74,999",97657.0


## DataFrame summaries
- **info():** Column data types

In [89]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1060 entries, 0 to 1059
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Year                    1060 non-null   object 
 1   Age                     1060 non-null   object 
 2   Gender                  1060 non-null   object 
 3   Educational Attainment  1060 non-null   object 
 4   Personal Income         1060 non-null   object 
 5   Population Count        1026 non-null   float64
dtypes: float64(1), object(5)
memory usage: 49.8+ KB


- **describe():** Summary statistics for numerical columns

In [90]:
X.describe()

Unnamed: 0,Population Count
count,1026.0
mean,185542.9
std,218700.5
min,1048.0
25%,30522.5
50%,90393.0
75%,287951.8
max,1643095.0


- **head(), tail():** Show first or last few rows of a table

In [91]:
X.head(10)

Unnamed: 0,Year,Age,Gender,Educational Attainment,Personal Income,Population Count
0,01/01/2008 12:00:00 AM,00 to 17,Male,Children under 15,No Income,
1,01/01/2008 12:00:00 AM,00 to 17,Male,No high school diploma,No Income,650889.0
2,01/01/2008 12:00:00 AM,00 to 17,Male,No high school diploma,"$5,000 to $9,999",30152.0
3,01/01/2008 12:00:00 AM,00 to 17,Male,No high school diploma,"$10,000 to $14,999",7092.0
4,01/01/2008 12:00:00 AM,00 to 17,Male,No high school diploma,"$15,000 to $24,999",3974.0
5,01/01/2008 12:00:00 AM,00 to 17,Male,No high school diploma,"$25,000 to $34,999",2606.0
6,01/01/2008 12:00:00 AM,00 to 17,Male,No high school diploma,"$35,000 to $49,999",2227.0
7,01/01/2008 12:00:00 AM,00 to 17,Male,High school or equivalent,No Income,
8,01/01/2008 12:00:00 AM,00 to 17,Male,"Some college, less than 4-yr degree",No Income,8664.0
9,01/01/2008 12:00:00 AM,00 to 17,Male,"Some college, less than 4-yr degree","$10,000 to $14,999",1304.0


In [92]:
X.tail()

Unnamed: 0,Year,Age,Gender,Educational Attainment,Personal Income,Population Count
1055,01/01/2014 12:00:00 AM,65 to 80+,Female,Bachelor's degree or higher,"$15,000 to $24,999",82988.0
1056,01/01/2014 12:00:00 AM,65 to 80+,Female,Bachelor's degree or higher,"$25,000 to $34,999",59607.0
1057,01/01/2014 12:00:00 AM,65 to 80+,Female,Bachelor's degree or higher,"$35,000 to $49,999",113584.0
1058,01/01/2014 12:00:00 AM,65 to 80+,Female,Bachelor's degree or higher,"$50,000 to $74,999",97657.0
1059,01/01/2014 12:00:00 AM,65 to 80+,Female,Bachelor's degree or higher,"$75,000 and over",110009.0


# Overview: The 'Black Box' Metaphor For Machine Learning
Historically, machine learning has been described as a ‘black box’. A ‘black box’ is a box or system in which you cannot see what is happening inside. You can only know what goes in (the inputs) and what comes out (the outputs). Machine learning professionals use this metaphor to describe a statistical model whose operations are so complex that they are not visible to the end-user. As you progress through this program, keep this metaphor in mind as a framework for conceptualizing the possibilities and limitations of machine learning.

# Overview: Introduction to Statistical Analysis
At its core, statistical analysis is the underlying driver of machine learning and artificial intelligence models. In this section of Module 1, you will be introduced to the concept of probability distribution, which can be defined as all the possible outcomes that a random variable can take within a given range. There are two types of probability distribution: discrete or continuous.

In addition, it is important to mention that due to the complexity of machine learning and artificial intelligence problems, a standard, interactive, end-to-end structured approached is essential to make the statistical analysis more functional. The CRISP-DM (Cross-Industry Standard Process for Data Mining), shown in the diagram below, is one of the most popular frameworks in the industry. It includes six key steps: business understanding, data understanding, data preparation, modeling, evaluation and deployment. Keep this in mind as you move through Modules 1 and 2 and build your understanding of statistical analysis and probability. More information wil be provided about the CRISP-DM in Module 3.

<img src="Images/1.png">

# Mini-Lesson 1.1: Calculating Expected Value and Variance
## Expected value 
The term expected value means exactly what it sounds like: the result you can expect from some action. It can be calculated by summing the values of a random variable with each value multiplied by its probability of occurrence. Formulaically, expected value is expressed as:

<img src="Images/2.png">

Where:

- ∑  -  sum
- i   -  index
- n  -  total number of possible outcomes
- Xi -  value of outcome i
- Pi -  probability of observing outcome

### An example
Consider the scenario where a fair, three-sided die is rolled to obtain a number based upon a random variable. The total number of possible outcomes is three, so n=3. The value of each of the outcomes can only be one, two, or three, and the probability of seeing any specific outcome would be one-third (a three-sided fair die). 

To calculate the expected value, you take the first possibility from the die (one) and multiply it by the probability of that result coming up (one-third). Then, you do the same for the other possible results (two multiplied by one-third, and three multiplied by one-third) until all possibilities are exhausted. Once the multiplication is complete, you add the results together to get the expected value. Therefore, your calculation looks like the following:

<img src="Images/3.png">

Therefore, the expected value would be 2 (one-third + two-thirds + 1).

## Expected variance
Utilizing the expected value, you can calculate the long-term average, and since random variables change in each turn, roll, draw, or trial, you understand that these variables will have some type of variation. The formula for calculating expected variance is as follows:

<img src="Images/4.png">

As with the first example, you now want to calculate the variance based upon a three-sided die. In order to calculate this, you have already found E[X]. You can calculate (E[X])2 as 22 = 4. In order to complete the formula, you now need to solve for E[X2]. This can be calculated in the following manner:

<img src="Images/5.png">

The final step in the process would be to subtract the final equation as shown below:

<img src="Images/6.png">

# Mini-Lesson 1.2: The Basics of Using pandas
As mentioned before, pandas is a top Python library utilized for analyzing data by data scientists and analysts and is the foundation of most data projects. Indeed, it has become one of the preferred libraries for data wrangling or munging for professionals in the ML industry. How does it work? Pandas extracts data from a file like .csv, or any SQL-based database, and creates a Python object of rows and columns called a dataframe. The dataframe looks very similar to a table in a relational database or an Excel spreadsheet. The columns of the dataframe make up what is called a series. 

In the following example, Widget A is composed of a series. It also includes an auto-generated index column to the left that pandas creates (0, 1, 2, 3). This same structure applies for Widget B. The combination of Widget A and Widget B creates a dataframe with the same index (0, 1, 2, 3) for each series.

<img src="Images/7.png">

This image depicts Widget A with index, Widget B with index, and a combined dataframe with index.

There are many ways to create a dataframe in Python. When working with spreadsheet-based files, creating a dataframe is simple if you have already installed the pandas package. If you have not yet installed pandas, complete this pandas installation tutorial Links to an external site..

Once installed, complete the following steps to create a dataframe. The text in **bold italics** is the code you can use to complete each step.

1. Import the pandas library and abbreviate the name by renaming it as pd.
    **import pandas as pd**
2. Read by default the first spreadsheet of the spreadsheet-based file (i.e. .csv, .xlsx). By reading the spreadsheet using the pandas read_excel module, you will automatically create a dataframe of the Excel file.
    **dataframe1 = pd.read_excel(’C:\\YourDriveLocation\SampleFile.xlsx’)** 
3. Print the dataframe to the screen to examine its content.
    **print(dataframe1)**

## Codio Activity 1.1: Python Coding
This assignment focuses on creating DataFrames using a dictionary, url, and local file.
    
### Index:

- [Problem 1: Creating a DataFrame](#Problem-1:-Creating-a-DataFrame)
- [Problem 2: Create a dataframe from url](#Problem-2:-Create-a-datafram-from-url)
- [Problem 3: Create a dataframe by importing data from an file](#Problem-3:-Create-a-dataframe-by-importing-data-from-an-file)

In [93]:
import pandas as pd

### Problem 1: Creating a DataFrame
Use a dictionary to create a DataFrame. Assign your results to the variable total_widgets below.

In [94]:
###GRADED
#Create a dataframe from the the widgets dictionary

#Create the dictionary of simulated data
data = {
    'WidgetA': [10, 9, 7, 8], 
    'WidgetB': [11, 12, 30, 1]
}

#And then pass it to the pandas DataFrame constructor total_widgets

total_widgets =  None

# YOUR CODE HERE
total_widgets = pd.DataFrame(data)

#Printing the total widgets 
print(total_widgets)

   WidgetA  WidgetB
0       10       11
1        9       12
2        7       30
3        8        1


### Problem 2: Create a dataframe from url
Now, use a url to load data to a DataFrame. The url below links to a .csv file in a github repository containing data about diamonds. Assign your resulting DataFrame to the variable diamonds below. 

In [95]:
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv'

In [96]:
### GRADED

diamonds = ''

# YOUR CODE HERE
diamonds = pd.read_csv(url)

# Answer check
print(type(diamonds))
diamonds.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


###  Problem 3: Create a dataframe by importing data from an file
Now, use the datafile located in the data folder named nba.csv. Assign your results as a DataFrame to the variable nba below. 

In [97]:
### GRADED

nba = ''

# YOUR CODE HERE
path = 'DataSets/nba.csv'
nba = pd.read_csv(path)

# Answer check
print(type(nba))
nba.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


## Codio Activity 1.2: Exploring a DataFrame 
In this exercise you will use basic summarization methods on a pandas DataFrame.

### Index:

- [Problem 1: Describing the data](#Problem-1:-Describing-the-data)
- [Problem 2: Getting the missing data information](#Problem-2:-Getting-the-missing-data-information)
- [Problem 3: Examining top 10 rows](#Problem-3:-Examining-top-10-rows) 

In [98]:
import pandas as pd
import numpy as np

In [99]:
nba = pd.read_csv('DataSets/nba.csv', index_col = 0)

### Problem 1: Describing the data
Using the DataFrame `nba` loaded above, get a summary of the numeric variables using the `.describe` method.  Assign your results as a DataFrame to `nba_summary` below. 

In [100]:
### GRADED

nba_summary = None

# YOUR CODE HERE
nba_summary = nba.describe()

# Answer check
print(nba_summary)
print(type(nba_summary))

           Number         Age      Weight        Salary
count  457.000000  457.000000  457.000000  4.460000e+02
mean    17.678337   26.938731  221.522976  4.842684e+06
std     15.966090    4.404016   26.368343  5.229238e+06
min      0.000000   19.000000  161.000000  3.088800e+04
25%      5.000000   24.000000  200.000000  1.044792e+06
50%     13.000000   26.000000  220.000000  2.839073e+06
75%     25.000000   30.000000  240.000000  6.500000e+06
max     99.000000   40.000000  307.000000  2.500000e+07
<class 'pandas.core.frame.DataFrame'>


### Problem 2: Getting the missing data information 
In the National Basketball Association (NBA), there are some athletes that do not take the usual career path of playing for a college and then become a professional athlete. Some athletes skip college and go professional directly out of high school. How many athletes in this dataframe skipped college and went straight to the professional level?

To find this answer, create a series with the counts of null values for each column in the `nba` DataFrame.  Assign your solution as a pandas Series to `nba_nulls` below.  **HINT**: Use the `.isnull` and `.sum` methods together. 

In [101]:
### GRADED

nba_nulls = None

# YOUR CODE HERE
nba_nulls=nba.isnull().sum()

# Answer check
print(nba_nulls)
print(type(nba_nulls))

Team         1
Number       1
Position     1
Age          1
Height       1
Weight       1
College     85
Salary      12
dtype: int64
<class 'pandas.core.series.Series'>


### Problem 3: Examining top 10 rows
Use the `.head` method to display the first 10 rows of the nba data.  Assign your results as a DataFrame to `nba_head` below.

In [102]:
### GRADED

nba_head = None

# YOUR CODE HERE
nba_head=nba.head(10)

# Answer check
print(type(nba_head))
print(nba_head.shape)

<class 'pandas.core.frame.DataFrame'>
(10, 8)


## Codio Activity 1.3: Selecting Data in Multiple Ways

### Index:

- [Part one: loc[]](#Part-one:-loc[])
- [Part two: iloc[]](#Part-two:-iloc[])
- [Part three:Split](#Part-three:Split)

In [103]:
import pandas as pd

### Part one: loc[]
In this exercise, return the basic information from a dataframe using loc[]

In [104]:
#create the DataFrame utilizing Pandas 
data = pd.read_csv('DataSets/nba.csv')
data = data.iloc[:-1 , :]
data.head(5)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [105]:
### GRADED

# Please print the information for the first line in the DataFrame. 

locinfo = None

# YOUR CODE HERE
locinfo = data.loc[0]

# ANSWER CHECK
print(locinfo)

Name         Avery Bradley
Team        Boston Celtics
Number                 0.0
Position                PG
Age                   25.0
Height                 6-2
Weight               180.0
College              Texas
Salary           7730337.0
Name: 0, dtype: object


In [106]:
### GRADED
# Please show LeBron James' team

teamdata = None

# YOUR CODE HERE
teamdata = data.loc[data['Name']=="LeBron James",["Team"]]

# ANSWER CHECK
print(teamdata)

                    Team
169  Cleveland Cavaliers


### Part two: iloc[]
In this exercise, return single selections using iloc[]

In [107]:
### GRADED
# Please print the first row of data frame 

fr = None

# YOUR CODE HERE
fr = data.iloc[0]

# ANSWER CHECK
print(fr)

Name         Avery Bradley
Team        Boston Celtics
Number                 0.0
Position                PG
Age                   25.0
Height                 6-2
Weight               180.0
College              Texas
Salary           7730337.0
Name: 0, dtype: object


In [108]:
### GRADED
# Please print the second row of data frame 

sr = None

# YOUR CODE HERE
sr = data.iloc[1]

# ANSWER CHECK
print(sr)

Name           Jae Crowder
Team        Boston Celtics
Number                99.0
Position                SF
Age                   25.0
Height                 6-6
Weight               235.0
College          Marquette
Salary           6796117.0
Name: 1, dtype: object


In [109]:
### GRADED
# Please print the last row of data frame

lr = None

# YOUR CODE HERE
lr = data.iloc[-1]

# ANSWER CHECK
print(lr)

Name        Jeff Withey
Team          Utah Jazz
Number             24.0
Position              C
Age                26.0
Height              7-0
Weight            231.0
College          Kansas
Salary         947276.0
Name: 456, dtype: object


In [110]:
###GRADED
# Please print the first column of data frame
 
fc = None

# YOUR CODE HERE
fc = data.iloc[:,0]

# ANSWER CHECK
print(fc)

0      Avery Bradley
1        Jae Crowder
2       John Holland
3        R.J. Hunter
4      Jonas Jerebko
           ...      
452       Trey Lyles
453     Shelvin Mack
454        Raul Neto
455     Tibor Pleiss
456      Jeff Withey
Name: Name, Length: 457, dtype: object


In [111]:
### GRADED
# Please print the second column of data frame

sc = None

# YOUR CODE HERE
sc = data.iloc[:,1]

# ANSWER CHECK
print(sc)

0      Boston Celtics
1      Boston Celtics
2      Boston Celtics
3      Boston Celtics
4      Boston Celtics
            ...      
452         Utah Jazz
453         Utah Jazz
454         Utah Jazz
455         Utah Jazz
456         Utah Jazz
Name: Team, Length: 457, dtype: object


In [112]:
### GRADED
# Please print the last column of data frame

lc = None

# YOUR CODE HERE
lc = data.iloc[:,-1]

# ANSWER CHECK
print(lc)

0      7730337.0
1      6796117.0
2            NaN
3      1148640.0
4      5000000.0
         ...    
452    2239800.0
453    2433333.0
454     900000.0
455    2900000.0
456     947276.0
Name: Salary, Length: 457, dtype: float64


### Part three:Split
In this exercise, return the first and last name utilizing the split function

In [113]:
### GRADED

# Please print the first name of the first series in the data frame

firstname = None

# YOUR CODE HERE
firstname = data.loc[0,'Name'].split()[0]

# ANSWER CHECK
print('The first name is ' + firstname)

The first name is Avery


In [114]:
###GRADED

# Please print the last name of the first series in the data frame

lastname = None

# YOUR CODE HERE
lastname = data.loc[0, 'Name'].split()[-1]

# ANSWER CHECK
print('The last name is ' + lastname)

The last name is Bradley


## Discussion 1.2: Data Repositories
Conduct an Internet search for data repositories in CSV format, which are available to the general public. Examples of places you could search for datasets are Kaggle Links to an external site.and Stack Overflow Links to an external site.. Select three to four datasets that are of interest to you. Then, for each dataset you choose, list the following information:

- A short description of the content of the data
- One or two possible business uses of the data
- The location (URL) of the dataset 

A summary of the characteristics of the dataset using the describe() function in pandas
Please note: You will use one of these datasets again later in this module, so please select an appropriately sized dataset for this program and your skill level. A dataset containing 10 columns and 500 rows is a good size to start. Hint: If your dataset contains data from many years, you may want to select just one year or a few years to work with later.

### Hourly Energy Consumption

The selected file was "AEP_hourly.csv," which represents the estimated energy consumption of American Electric Power (AEP) in Megawatts (MW).

- Business uses:
    - It can help calculate the electricity tariff based on its usage/traffic.
    - It can be used to observe the typical energy consumption and identify anomalies that could be considered as "losses".
- Location: https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption
- Summary:

| AEP_MW    | Descripción |
|-----------|-------------|
| count     | 121273      |
| mean      | 15499.5137  |
| std       | 2591.3991   |
| min       | 9581.0      |
| 25%       | 13630.0     |
| 50%       | 15310.0     |
| 75%       | 17200.0     |
| max       | 17200.0     |

### Domestic Violence in Colombia
Compilation of data on domestic violence that gas been developing in Colombia over time.

- Business uses:
    - It can be used to determine which gender reports more cases of domestic violence and implement reinforced support measures for this population.
    - It can be used to identify which department has the highest number of such cases and where to focus efforts.
- Location: https://www.kaggle.com/datasets/oscardavidperilla/domestic-violence-in-colombia
- Summary:
	CANTIDAD

|          | Descripción     |
|----------|-----------------|
| count    | 476970.000000  |
| mean     | 1.707764        |
| std      | 3.338647        |
| min      | 1.000000        |
| 25%      | 1.000000        |
| 50%      | 1.000000        |
| 75%      | 1.000000        |
| max      | 130.000000      |

### Graduation Rate
The dataset includes 1000 rows, with one row for each high school in the dataset. The graduation rates for each school were generated randomly, and are not based on any actual data.

- Business uses:
    - It can be used to partially evaluate how much the SAT determines the number of years it may take a student to graduate from the program.
    - It can help identify if the fact that parents have certain levels of education influences the student's education.
- Location: https://www.kaggle.com/datasets/rkiattisak/graduation-rate
- Summary:
	ACT composite score	SAT total score	parental income	high school gpa	college gpa	years to graduate
    
|                            | ACT composite score | SAT total score   | Parental income   | High school GPA   | College GPA       | Years to graduate   |
|--------------------------- |--------------------- |------------------- |------------------- |------------------- |------------------- |--------------------- |
| count                      | 1.000.000.000       | 1.000.000.000     | 100.000.000       | 1.000.000.000     | 1.000.000.000     | 1.000.000.000       |
| mean                       | 28.607.000          | 1.999.906.000     | 6.737.785.200     | 3.707.400         | 3.376.500         | 4.982.000           |
| std                        | 2.774.211           | 145.078.361       | 1.882.733.105     | 0.287381          | 0.237179          | 1.414.099           |
| min                        | 20.000.000          | 1.598.000.000     | 1.890.600.000     | 2.800.000         | 2.600.000         | 3.000.000           |
| 25%                        | 27.000.000          | 1.898.000.000     | 5.426.975.000     | 3.500.000         | 3.200.000         | 4.000.000           |
| 50%                        | 28.500.000          | 2.000.000.000     | 6.784.250.000     | 3.800.000         | 3.400.000         | 5.000.000           |
| 75%                        | 31.000.000          | 2.099.000.000     | 8.046.550.000     | 4.000.000         | 3.500.000         | 6.000.000           |
| max                        | 36.000.000          | 2.385.000.000     | 12.447.000.000    | 4.000.000         | 4.000.000         | 10.000.000          |



# Mini-Lesson 1.3: Analyzing Data Using pandas
As a supplement to Video 1.6, this mini-lesson demonstrates how to calculate some common mathematical functions in pandas, such as divide, sum, and percent. It is common to perform these basic mathematical operations while dealing with data science or machine learning tasks. Pandas’ use of mathematical operations will help with data handling and manipulation. 

## Division
To divide two numbers in Python you can separate the two values using the / sign.

                                    df[divide] = df[’column1’] / df[’colulmn2’]

## Summation
To sum two numbers in python you can add the two values together using the plus sign.

                                    df[sum] = df[’column1’] + df[’colulmn2’]

## Percentage
While pandas does not natively provide a function to calculate the percentage, this can be done by dividing the column by its sum and then multiplying by 100 as shown below:

                                    df[percent] = (df[’col’] / df[’col’].sum()) * 100
# Mini-Lesson 1.4: What Are Histograms and Why Do They Matter?
In addition to discussing column operations in Video 1.6, Dr. Gomes also shared about plotting in pandas. As a supplement, you can read about a few of the advantages of visualizing your data with histograms in this mini-lesson.

## Histograms
Histograms are commonly used to visualize the distribution of a single numerical variable. The histogram divides a numeric variable into multiple bins, and these bins calculate the observations that fall into each bin. This columnar representation of binned counts gives you an instant sense of the distribution of values in a variable. Pandas provides a function for this called hist() that generates a histogram per column of numerical data. In data science, histograms are important during exploratory data analysis because they reveal properties about data in ways that summary statistics cannot. Histograms reveal the shape of the data, skewed data, and outliers that might be present.

## Data shapes
Histograms can assist in determining the symmetry of your dataset. Normally distributed data, which typically has its infamous bell-shaped curve (as in Figure 1), lets you know that your dataset is more analytically appealing.

<img src="Images/8.png">

## Skewed data
The identification of skewed data using a histogram is an important factor in determining the suitability of the dataset. Typically, a histogram with skewed data differs from a normal distribution in that one tail seems to drag on longer than the other side. This skew can be to the left (as in Figure 2) or the right. This may require that you perform some data transformation to fit a more normal distribution.

<img src="Images/9.png">

## Outliers
Outliers (as in Figure 3) are an important tool when visualizing histograms. Outliers can be explained or unexplained observations, and they may require some type of investigation to understand their legitimacy. Once the investigation of this anomaly has been completed, you can perform some data transformation to fit a more normal distribution.

<img src="Images/10.png">

## Discussion 1.3: Creating Visualizations Using Personally-Sourced Data
Create a visualization using the pandas plot() function (scatter, bar chart, line) and post your most interesting results. For the visualization that you selected, please describe why you chose that plot type and any transformations to the data that you had to make in order to generate your visualization. Additionally, please describe any interesting trends (i.e., increased monthly sales) that you observe in your results.

### Domestic Violence in Colombia

For my case, I used the dataset of domestic violence in Colombia because I was curious about three different facts:

1. Which gender had the majority of cases?

    **Data Wrangling:** For this value, I simply had to perform the summation of each case based on its gender. As you can observe, there are some cases where the gender of the victim was not reported. Since the 'Gender' value is descriptive rather than quantitative, I decided to use a bar chart to conduct the analysis.
    
    **Findings:** As can be seen, the majority of reported cases were directed towards female victims in an alarming quantity, almost four times the number of male population. This highlights a clear trend of a population that is being abused (it should be noted that it does not indicate anything about the gender, age, or other characteristics of the perpetrator).

<img src="Images/12.png">

2. In which years were there high and low numbers of these cases?

    **Data Wrangling:** In this case, I compared the number of cases per year. Since they are temporal measurements, I decided to use a line graph to see if any trend could be visualized. The data did not need to be treated in any special way.
    
    **Findings:** This graph shows an increase in the number of cases from 20-40 cases per year to over 120. However, for a period after 2019, there was a significant drop. It is highly probable that the data for that period may not be accurate because, as reported by various media outlets, there was a alarming increase in these cases throughout the country during the pandemic.

<img src="Images/13.png">

3. In which department were these cases more prevalent?

    **Data Wrangling:** In this last graph, I resorted once again to a bar chart as it serves the same purpose as the first one. That is, the value to compare is descriptive rather than quantitative, and the number of departments is not large enough to saturate the x-axis of the graph. For the dataset used in this graph, I had to perform the summation of cases reported per department.
    
    **Findings:** We can see that places like "Valle" or "Antioquia" have a much higher number of cases compared to other departments. However, the most notable aspect in this graph is the overwhelmingly large number of cases compared to other departments in "Cundinamarca" (which is home to the country's capital). It is interesting to note that the three places with the highest number of cases are neighboring locations.

<img src="Images/14.png">

#### Conclusion
It is alarming to see that the number of cases is increasing every year. This indicates that further efforts are needed to raise awareness and educate the population (Especially in areas with a high number of cases like Cundinamarca) about such sensitive issues that undermine the family unit, such as domestic violence. Support must be provided to the victims of these acts, who are mostly women (while not disregarding or undervaluing male victims).
Whether it is due to a deeply-rooted cultural issue in Colombia or poor social and emotional management, there is no excuse for not changing these patterns and striving for a more just society where violence, regardless of its gender, is not tolerated at all.

# Glossary
- **‘Black Box’:**
This is a metaphor for understanding machine learning.
- **Continuous Distribution:**
This is a distribution in which data can take on any value within a specified range (which may be infinite). Temperature, for instance, can be on a continuum from absolute zero to the surface of the sun, but also, there can be infinite points of data between 12⁰C and 13⁰C, depending on the quality of the thermometer.
- **Dataframe:**
This is a Python object of rows and columns.
- **Discrete Distribution:**
This is a distribution in which data can only take on certain values, for example, integers. The recording of the outcomes of a six-sided dice roll would generate a discrete distribution because there are only six possibilities for an outcome.
- **Expected Value:**
This is the result you can expect from some action. It can be calculated by summing the values of a random variable with each value multiplied by its probability of occurrence.
- **Expected Variance:**
This is a measure of how far a set of numbers is spread out from their average value.
- **Histogram:**
The histogram divides a numeric variable into multiple bins, and these bins calculate the observations that fall into each bin. This columnar representation of binned counts gives you an instant sense of the distribution of values in a variable.
- **Mean:**
The mean is the average of a set of given numbers. For instance, in the given number set [0,1,2,5,6,15,20], the mean would be the sum of all the numbers divided by the count of numbers in the set. The sum of all numbers is 49. There are 7 numbers in the set, so the mean would be 49/7, or 7.
- **Median:**
The median is the middle number in a set of given numbers. For instance, in the given number set [0,1,2,5,6,15,20], the median would be the middle number, or 5.
- **Outliers:**
These are explained or unexplained values that lie outside the normal distribution.