# Python Data Science Libraries

This chapter introduces you to three of the more popular data science libraries: NumPy, Pandas, Scikit-learn

## Numpy
- Numpy stands for numeric python library
- It is used for working with arrays
- Numpy arrays are like Python lists but they require less memory since they use optimized, precompiled C code
- Numpy arrays support element-wise operations.

### Creating a Numpy Array

In [1]:
import numpy as np
jeff_salary = [2700, 3000, 3000]
nick_salary = [2600, 2800, 2800]
tom_salary = [2300, 2500, 2500]
base_salary = np.array([jeff_salary, nick_salary, tom_salary])
print(base_salary)

[[2700 3000 3000]
 [2600 2800 2800]
 [2300 2500 2500]]


example: employee's monthly bonuses

In [2]:
jeff_bonus = [500, 400, 400]
nick_bonus = [600, 300, 400]
tom_bonus = [200, 500, 400]
bonus = np.array([jeff_bonus, nick_bonus, tom_bonus])

Performing element-wise operations

In [3]:
salary_bonus = base_salary + bonus
print(type(salary_bonus))
print(salary_bonus)

<class 'numpy.ndarray'>
[[3200 3400 3400]
 [3200 3100 3200]
 [2500 3000 2900]]


### Using Numpy for Statistical Functions
- allows for statistical analysis on data

In [7]:
# finding the maximum value in salary_bonus
print(salary_bonus.max())

# finding the maximum value of an array along a given axis
# horizontal, x = 1 (maximum monthly amount paid to each employee in the past three months)
maximum_for_employee = np.amax(salary_bonus, axis = 1)
print(maximum_for_employee)

# vertical, x = 0 (maximum amount received by an employee each month)
maximum_for_month = np.amax(salary_bonus, axis = 0)
print(maximum_for_month)

3400
[3400 3200 3000]
[3200 3400 3400]


Exercise #2

In [9]:
# Average of the maximum monthly amount paid to each employee in the past three months
print(np.mean(maximum_for_employee))

# Average for the maximum amount received by an employee each month
print(np.mean(maximum_for_month))

3200.0
3333.3333333333335


# Pandas
- Name "Pandas" derived from the Python Data Analysis Library
- Contains two data structures: Series (1D) and DataFrame (2D)
- DataFrame is the primary one but it's just a collection of Series objects.
- Therefore, Series are just as important as DataFrame

### pandas Series
- A pandas Series is a 1D labeled array
- elements in a Series are labeled with integers according to their position, like in a Python list
- The labels don't have to be unique but they must be a hashable type (integers, floats, strings or tuples).
- Ultimately, a Series is a column in a DataFrame

# Creating a Series

In [10]:
import pandas as pd
data = ['Jeff Russell', 'Jane Boorman', 'Tom Heints']
emps_names = pd.Series(data)
print(emps_names)

0    Jeff Russell
1    Jane Boorman
2      Tom Heints
dtype: object


In [11]:
# creating a Series with user-defined indices
data = ['Jeff Russell', 'Jane Boorman', 'Tom Heights']
emps_names = pd.Series(data, index = [9001, 9002, 9003])
print(emps_names)

9001    Jeff Russell
9002    Jane Boorman
9003     Tom Heights
dtype: object


### Accessing Data in a Series

In [15]:
print(emps_names[9001])
print("---")

# Alternatively
print(emps_names.loc[9001])
print("---")

# You can still access with integer-based indexing
print(emps_names.iloc[0])
print("---")

# Accessing multiple elements by their indices with a slice operation
print(emps_names.loc[9001:9002])
print("---")
print(emps_names.iloc[0:2])
print("---")
print(emps_names[0:2])


Jeff Russell
---
Jeff Russell
---
Jeff Russell
---
9001    Jeff Russell
9002    Jane Boorman
dtype: object
---
9001    Jeff Russell
9002    Jane Boorman
dtype: object
---
9001    Jeff Russell
9002    Jane Boorman
dtype: object


### Combining Series into a DataFrame

In [16]:
data = ['jeff.russell', 'jane.boorman', 'tom.heints']
emps_emails = pd.Series(data, index = [9001, 9002, 9003], name = 'emails')
emps_names.name = 'names'
df = pd.concat([emps_names, emps_emails], axis = 1)
print(df)

             names        emails
9001  Jeff Russell  jeff.russell
9002  Jane Boorman  jane.boorman
9003   Tom Heights    tom.heints
