# Python Modules, Packages and Visualization
**Learning Objectives**
* Introduce Python modules, packages, and visualization
* Practice working with the concepts learned



## Python Modules
- File containing Python definitions and statements
- Modules can be considered as a code library
- Break down larger programs into small manageable units
- Make the code easier to understand and use

Ex: Python in built modules
- `math` for mathematical functions
- `os` for creating, removing director ies, fetching its contents etc.
- `statistics` for mathematical statistics of numeric data
- `random` for generating pseudo random variables
- `NumPy`: Numeric python for scientific computing in Python
- `pandas`: Data analysis tasks
- `matplotlib`: data visualization in Python

In [None]:
## display all available modules
help("modules")

### Using Python modules
#### Syntax: `import<module name>`

In [None]:
##import math module
import math
print(math.pi)  ##Print pi value
print(math.sqrt(25))    ##print square root of number

### Renaming Python modules
- You can create an alias while importing a module
- Use that alias for accessing the functionalities of module
#### Syntax: `import <module name> as mx`

In [None]:
####renaming math module
import math as m
print(m.pi)
print(m.sqrt(36))

### Import from Module
- Parts of a module can be imported without importing the module as a whole

#### Syntax: `from <module name> import <parts of module>`


In [None]:
####importing exponential and pi from math module
from math import pi, e
print(pi)
print(e)

### Import everything from Module
- Use * symbol to import all the functions/names from the module
#### Syntax: `from <module name> import *`

In [None]:
####Importing all names from math module
from math import *
print(pi)
print(e)

### Problem 1
#### Write commands to perform below operations:
Import random module
- i. Print random integer between 0 and 10
- ii. Use seed 5 and print random numbers
- iii. Select random elements from list L = [2,3,4,67,89]
- iv. Shuffle lists L



In [None]:
# importing built in module random
import random

In [None]:
# printing random integer between 0 and 10
print(random.randint(0, 10)) 

In [None]:
##print random numbers
random.seed(5)
print(random.random())

In [None]:
##seleting random elements
L = [2,3,4,67,89]
print(random.choice(L))

In [None]:
###shuffling lists
random.shuffle(L)
print(L)

### Problem 2
Import datetime module
- i. Import date from datetime
- ii. Print today’s date
- iii. Format today’s date in below formats:
    - 1. dd/mm/yy
    - 2. Textual month, day and year
    - 3. Mm/dd/yy
    - 4. Month abbreviation- day-yea

In [None]:
# importing built in module datetime
from datetime import date
today = date.today()
print("Today's date:", today)

In [None]:
###Current date in different formats
from datetime import date

today = date.today()

# dd/mm/YY
d1 = today.strftime("%d/%m/%Y")
print("d1 =", d1)

# Textual month, day and year	
d2 = today.strftime("%B %d, %Y")
print("d2 =", d2)

# mm/dd/y
d3 = today.strftime("%m/%d/%y")
print("d3 =", d3)

# Month abbreviation, day and year	
d4 = today.strftime("%b-%d-%Y")
print("d4 =", d4)

### Problem 3
Write a custom module named as `calc.py`
- i. Import the module
- ii. Use add function
- iii. Rename the module with cal
- iv. Use renamed module to perform addition
- v. Import subtract from the calc module
- vi. Use it to perform subtraction
- Import everything from the module and use addition and subtraction  operations

In [None]:
####import calc.py file
import calc 
calc.add(1,2)

In [None]:
####renaming calc.py file
import calc as cal  
print(cal.add(10,2)) 

In [None]:
####import from module
from calc import subtract     
print(subtract(10,2))  

In [None]:
####import everything from module
from calc import *     
print(add(10,2))  
print(subtract(10,2))  

## Python Packages
- Consist of several modules
- Every package is a module
- Official repository is PyPI https://pypi.org)
Ex: Numpy , Pandas, Matplotlib, Scikit learn, Scipy

## Installing Packages
- To install an existing package
### Syntax: `!pip install <package name>`

In [None]:
##installing package gensim
!pip install gensim

## Data Visualization using python
### Matplotlib
- Basic library to visualize data
- Used to plot bar charts, scatter plots, and histograms

### Importing Matplotlib

In [None]:
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt

### Pyplot

### Pyplot is a collection of functions that make matplotlib work like MATLAB
- pyplot has below functionalities:
    - Creating figure
    - Creating plotting area
    - Plotting lines
    - Labelling plots

### Basic plots in Matplotlib

- Line plot
- Bar plot
- Histogram
- Scatterplot

### Line plots
- Used to show relationships between 2 numeric variables


In [None]:
from matplotlib import pyplot as plt #importing matplotlib module
#x axis values
x = [5, 7, 8, 9, 50, 60]
y = [20, 30, 40, 70, 80, 87] #y axis values
plt.plot(x,y ) #function

### Formatting line plot

In [None]:
from matplotlib import pyplot as plt #importing matplotlib module
x = [5, 7, 8, 9, 50, 60]
y = [20, 30, 40, 70, 80, 87] 
plt.ylabel('y axis') #adding y label
plt.xlabel('x axis') #adding x label
plt.plot(x,y ) #function to plot

### Changing axis range

In [None]:
from matplotlib import pyplot as plt #importing matplotlib module
x = [5, 7, 8, 9, 50, 60]
y = [20, 30, 40, 70, 80, 87] 
plt.ylabel('y axis') #adding y label
plt.xlabel('x axis') #adding x label
plt.axis([0,50, 0,70])
plt.plot(x,y ) #function to plot

### Bar plot
- Used to represent categorical data
- Length of each bar is proportional to the frequency of the corresponding category

In [None]:
from matplotlib import pyplot as plt #importing matplotlib module
names= ['groupa', 'groupb', 'groupc'] #categories on x axis
values = [1, 10, 100] #values on y axis
plt.bar(names, values, color = 'green') #function to plot
plt.xlabel('Groups') #adding x label
plt.ylabel('Number of people') #adding y label
plt.title('people statistics') #add title
plt.show() #show plot

### Histograms
- Used to represent distribution of numerical data
- Each bar is a contiguous interval called bin
- Summarize the distribution

In [None]:
from matplotlib import pyplot as plt #importing matplotlib module
data= [1,2,50,100,30,50,200,500]
plt.hist(data)

### Scatterplot
- Used to observe and show relationships between 2 numeric variables

In [None]:
from matplotlib import pyplot as plt #importing matplotlib module
x = [5, 7, 8, 9, 50, 60]
y = [20, 30, 40, 70, 80, 87] 
plt.ylabel('y axis') #adding y label
plt.xlabel('x axis') #adding x label
plt.scatter(x,y ) #function to plot

### Adding multiple plots

In [None]:
from matplotlib import pyplot as plt      #importing matplotlib module
x1 = [5, 7, 8, 9, 50, 60]                                 #x- axis values
y1 = [20, 30, 40, 70, 80, 87]                                #y- axis values
plt.subplot(1, 2, 1)                                 #add subplot
plt.xlabel('x-coordinate')                               #adding x label
plt.ylabel('y- coordinate')                                #adding y label
plt.title('Line plot')                                  #add title
plt.plot(x1,y1)                                          #function to plot

x2 = [5, 7, 8, 9, 50, 60]                                   #x- axis values
y2 = [20, 30, 40, 70, 80, 87]                                #y- axis values
plt.subplot(1, 2, 2)                                 #add subplot
plt.xlabel('x-coordinate')                                #adding x label
plt.ylabel('y- coordinate')
plt.title('Scatter plot')                                  #add title
plt.scatter(x2,y2)      

## Getting started with Numpy


#### More details about numpy can be taken form tutorial [Numpy Tutorial](https://numpy.org/doc/stable/user/absolute_beginners.html)

In [None]:
###if numpy is not installed use below command first to install 
#!pip3 install numpy

In [None]:
###Importing numpy package with alias as np
import numpy as np

### Creating Numpy Arrays

In [None]:
###creating a numpy array height
height = [1.73, 1.68, 1.71, 1.89, 1.79]
np_height = np.array(height)
np_height

In [None]:
###creating a numpy array weight
weight = [65.4, 59.2, 63.6, 88.4, 68.7]
np_weight = np.array(weight)
np_weight

In [None]:
###compute bmi
bmi = np_weight /np_height ** 2
bmi

### Major functions to create numpy arrays quickly as a sequence of elements

In [None]:
### creating numpy arrays with zeroes with 3 elements
np.zeros(3)

In [None]:
### creating numpy arrays with ones with 3 elements
np.ones(3)

In [None]:
####create a array with with range of elements
np.arange(4)

In [None]:
####create a array with with range of elements but with a step size
np.arange(2, 9, 2)

In [None]:
###Sorting a numpy array
height = [1.73, 1.68, 1.71, 1.89, 1.79]
np.sort(np_height)

### Knowing the shape and size of numpy array

In [None]:
###dimensions of height array
np_height.ndim

In [None]:
###size of height array which represents total number of elements
np_height.size

In [None]:
###shape of height array
###since it is one-dimensional array so the shape only contains one number
np_height.shape

In [None]:
###creating arrays of different data types in numpy
np.array([1.0, "is", True])

### Numpy automatically convert each element into string 

 #### Numpy + operator

In [None]:
python_list = [1,2,3]
python_list + python_list

In [None]:
np_list = np.array([1,2,3])
np_list + np_list

### Numpy Subsetting

In [None]:
bmi

In [None]:
####extracting single element from bmi array
print(bmi[0])
print(bmi[1])

In [None]:
####extracting multiple elements from bmi array
###Extracting second to fourth element 
bmi[1:4]

In [None]:
### Extracting first 3 elements
bmi[:3]

In [None]:
##Extracting from second element until last
bmi[1:]

In [None]:
###Extract first, third and fifth element 
bmi[0:5:2]

In [None]:
###Extracting elements with logical conditions 
###Extract all bmi's greater than 23
bmi > 23

In [None]:
###Extracting elements with logical conditions 
###Extract all bmi's greater than 23
bmi > 23

In [None]:
###Extract all bmi's greater than 10 and less than 22
bmi[(bmi > 21) & (bmi < 23)]

### 2D numpy array

In [None]:
np_2d = np.array([[1.73, 1.68, 1.71, 1.89, 1.79], [65.4, 59.2, 63.6, 88.4, 68.7]])
np_2d

### Subsetting 2d numpy array


In [None]:
###extracting first array
np_2d[0]

In [None]:
##extracting first element of first array
np_2d[0][0]

In [None]:
###Extracting element at row zero and column 2
np_2d[0][2]

In [None]:
###Extracting element at row zero and column 2
np_2d[0,2]

In [None]:
###Extarcting elements from all rows and second and third column
np_2d[:,1:3]

In [None]:
###Extracting second row
np_2d[1,:]

In [None]:
###Extracting second column
np_2d[:,1]

In [None]:
###Extract all elements with value > 1.6 
np_2d[np_2d > 1.6]

## Getting started with Pandas

#### More details about Pandas can be taken from tutorial [Pandas Tutorial](https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro)

In [None]:
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
import numpy as np

## Introduction to pandas Data Structures


## Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

### Creating Series

In [None]:
obj = pd.Series([4, 7, -5, 3])
obj

In [None]:
obj.values

In [None]:
obj.index  # like range(4)

### Changing Indices

In [None]:
###changed the index
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

In [None]:
obj2.index

### Access values by index

In [None]:
##value at index a
obj2['a']

In [None]:
obj2[['c', 'a', 'd']]

In [None]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

In [None]:
obj3["Ohio"]

In [None]:
obj3[["Ohio", "Texas"]]

In [None]:
###check missing values
pd.isnull(obj3)

## Data Frame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

### Creating Data frames

In [None]:
###Creating Data frame
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

In [None]:
###glimpse of data
frame.head()

In [None]:
###changing column order
pd.DataFrame(data, columns=['year', 'state', 'pop'])

In [None]:
###adding index to rows
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four',
                             'five', 'six'])
frame2

### Acessing values by column names

In [None]:
frame2

In [None]:
frame2['state']

In [None]:
frame2['year']

### Access columns by index

In [None]:
frame2

In [None]:
###select columns by index
frame2.iloc[:,1]

In [None]:
###select first and 2nd columns
frame2.iloc[:,[1,2]]

In [None]:
###select first, 2nd and 3rd columns
frame2.iloc[:,0:3]

### Select columns based onlabel indexing

In [None]:
frame2.loc[:,'year']

In [None]:
frame2.loc[:,['year','state']]

### Selecting rows

### Selecting rows by row index names

In [None]:
frame2

In [None]:
frame2.loc['one']

In [None]:
frame2.loc[['one','two']]

### Select rows by index

In [None]:
frame2.iloc[1]

In [None]:
frame2.iloc[1,:]

In [None]:
###select multiple rows by index
frame2.iloc[1:3]

### Selecting rows alternative

In [None]:
###Selecting first row
frame2[:1]

In [None]:
###Retruning all rows except the first row
frame2[1:]

In [None]:
###Returning everything
frame2[:]

In [None]:
frame2[1:3]

### Selecting Values by rows and column

In [None]:
frame2

In [None]:
###select 2nd row and third column
frame2.iloc[2,2]

### Before moving to the case study. Explore the ways how we can utilize the concepts we learned in basic python programmingto be applied for data analysis  

### Use of loops for data analysis
In Python data analysis, loops play a pivotal role. They provide a practical means to handle repetitive tasks, enabling us to iterate over data structures, apply functions, and manipulate data efficiently. Let’s dive into some key reasons why loops are indispensable in Python data analysis:

- <b> Efficient Data Manipulation </b>: Loops offer a simple and efficient way to perform repeated operations on data. For instance, you might need to apply a certain transformation to every element in a list or each row in a DataFrame. With loops, such tasks can be accomplished effortlessly.

- <b> Data Cleaning and Preprocessing </b>: Data in real-world scenarios is often messy and requires substantial cleaning and preprocessing. Loops come in handy for performing these tasks, like filling missing values, transforming data types, or normalizing data across various columns.

- <b> Feature Engineering </b>: Feature engineering, an essential step in preparing data for machine learning models, often involves creating new features based on existing ones. Loops facilitate such operations with ease.

- <b> Exploratory Data Analysis (EDA) </b>: Loops can be used to generate multiple plots for different data segments, thereby providing valuable insights during EDA.

- <b> Model Training and Evaluation </b>: Loops enable us to train multiple models, tune hyperparameters, or perform cross-validation systematically, thereby enhancing the model development process.

### To illustrate, consider the following code snippet that uses a for loop to calculate the mean of each column in a DataFrame:



In [None]:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [5, 15, 10, 20, 15]})
df

In [None]:
column_means = {}

for column in df.columns:
    column_means[column] = df[column].mean()

print(column_means)

This will return a dictionary with the means of each column: {'A': 3.0, 'B': 13.0}. As seen, Python loops simplify complex tasks and boost productivity in data analysis.

In [None]:
# Create list of average monthly precip (inches) in Boulder, CO
avg_monthly_precip_in = [0.70,  0.75, 1.85, 2.93, 3.05, 2.02, 
                         1.93, 1.62, 1.84, 1.31, 1.39, 0.84]

In [None]:
# Convert each item in list from inches to mm
for month in avg_monthly_precip_in:
    month *= 25.4
    print(month)

### Real World Examples of Python Loops in Data Analysis


1. <b> Data Cleaning </b>: Suppose you have a dataset where some numbers are stored as strings, and you need to convert them to integers. A Python loop can iterate over the dataset and perform the necessary conversion.

In [None]:
frame2

In [None]:
frame2['new'] = ['2','4','6','8','9','4']

In [None]:
frame2.info()

In [None]:
frame2

In [None]:
frame2.iloc[1,4]

In [None]:
for i in range(len(frame2)):
    if isinstance(frame2.iloc[i,4], str):
            frame2.iloc[i,4] = int(frame2.iloc[i,4])

In [None]:
type(frame2.iloc[1,4])

2. <b> Feature Extraction </b>: In text analysis, you often need to calculate certain features from the text data, like the length of each document. Loops can be used to perform such operations.

3. <b> Aggregating Data </b>: If you need to aggregate data based on certain criteria, loops can help. For instance, calculating the average temperature for each month from a dataset of daily temperatures:

In [None]:
# Import necessary packages
import numpy as np

# Array of average monthly precip (inches) for 2002 in Boulder, CO
precip_2002_arr = np.array([1.07, 0.44, 1.50, 0.20, 3.20, 1.18, 
                            0.09, 1.44, 1.52, 2.44, 0.78, 0.02])

# Array of average monthly precip (inches) for 2013 in Boulder, CO
precip_2013_arr = np.array([0.27, 1.13, 1.72, 4.14, 2.66, 0.61, 
                            1.03, 1.40, 18.16, 2.24, 0.29, 0.50])

In [None]:
# Create list of numpy arrays
arr_list = [precip_2002_arr, precip_2013_arr]

In [None]:
# Calculate sum and median for each numpy array in list
for arr in arr_list:    
    arr_sum = np.sum(arr)
    print("sum:", arr_sum)
    
    arr_median = np.median(arr)    
    print("median:", arr_median)    

4. <b> Creating Visualizations </b>: When performing exploratory data analysis, you might need to create multiple plots to understand your data better. For instance, creating a histogram for each numerical column in a DataFrame:

In [None]:
frame2.columns

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Suppose 'df' is a pandas DataFrame
for column in frame2.select_dtypes(include=['int', 'float']).columns:
    plt.figure()
    frame2[column].hist()
    plt.title(column)

### Use of functions for data analysis
- Functions are widely used to create reusable portions of code during data analysis
- We can create functions for EDA
    - Missing values
    - changing data types
    - For data visualization

## Case Study

## Read the dataset
- Download from blackboard

In [None]:
titanic = pd.read_csv("titanic.csv")

### Getting the glimpse of Data 

In [None]:
####glimpse of data
titanic.head()

In [None]:
###print first 8 rows of dataset
titanic.head(8)

In [None]:
###print last 5 rows of dataset
titanic.tail()

## Getting detailed summary of dataframe

In [None]:
## summary of dataframe
titanic.info()

In [None]:
###Data types of each column
titanic.dtypes

## Changing the type of columns as required

In [None]:
###Changing the type of column
titanic["Sex"] = titanic["Sex"].astype("category")
titanic["Sex"].dtype

In [None]:
## Converting columns 1, 2, 4 and 11 to categorical
for i in [1,2,4,11]:
    titanic.iloc[:,i] = titanic.iloc[:,i].astype("category")

In [None]:
titanic.dtypes

## Subsetting data


### Selecting specific columns from data


In [None]:
titanic.head()

In [None]:
###select a subset of dataframe
ages = titanic["Age"]
ages

In [None]:
###Extracting age column by index
ages = titanic.iloc[:,5]
ages

### Selecting multiple columns

In [None]:
### Selecting columns by index in a sequence (selecting first 3 columns)
titanic.iloc[:,0:3]

In [None]:
### Selecting columns by index in a sequence (selecting first 2nd to 5th column)
titanic.iloc[:,1:5]

### Selecting rows from data

In [None]:
###select first row from data
titanic.loc[0]

### Selecting multiple rows from data

In [None]:
###select first three rows from data
titanic.loc[0:2]

In [None]:
###select first three rows from data
titanic.iloc[:3]

In [None]:
###select first three rows from data
titanic.iloc[0:3,:]

In [None]:
###selecting 3rd and 5th row of data
titanic.loc[[3,5]]

In [None]:
###selecting 3rd and 5th row of data
titanic.iloc[[3,5],:]

## Selecting rows by condition

In [None]:
###select all passengers with age > 35
above_35 = titanic[titanic["Age"] > 35]
above_35

In [None]:
####Selecting all passengers with class as 2 or 3
class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
class_23

In [None]:
###Select all survived passengers in class 2
survived_pass = titanic[(titanic["Survived"] == 1) & (titanic["Pclass"] == 2)]
survived_pass

### Selecting both rows and columns

In [None]:
###filtering both rows and columns
###selecting rows from 9th to 25th and columns from first to 5th 
titanic.iloc[9:25, 2:5]

In [None]:
###select the names of all passengers above age 35
adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
adult_names

## Summary statistics for Data

In [None]:
###Mean age of passengers
titanic["Age"].mean()

In [None]:
###Median age and fare
titanic[["Age", "Fare"]].median()

In [None]:
###Detailed summary statistics
titanic[["Age", "Fare"]].describe()

### Aggregating statistics (usually for categorical variables)

### Aggregating for single variable/column

In [None]:
### Compute mean of every numerical variable by Sex
titanic.groupby("Sex").mean()

In [None]:
### Compute mean of every numerical variable by PClass
titanic.groupby("Pclass").mean()

### Aggregating by multiple columns

In [None]:
### Compute mean age of passengers grouped by class
titanic[["Age", "Pclass"]].groupby("Pclass").mean()

In [None]:
### Compute mean age of passengers grouped by sex
titanic[["Age", "Sex"]].groupby("Sex").mean()

### What is the mean ticket fare price for each of the sex and cabin class combinations?

In [None]:
titanic[["Fare", "Sex", "Pclass"]].groupby(["Sex", "Pclass"]).mean()

### Count the number of categorical variables


In [None]:
### Number of passengers in each class
titanic["Pclass"].value_counts()

In [None]:
### Number of males and females
titanic["Sex"].value_counts()

In [None]:
### Number of survived and not survived
titanic["Survived"].value_counts()

In [None]:
###Number of class with survived
titanic[["Survived", "Sex"]].value_counts()

## Data Preprocessing

In [None]:
###check for missing values in data
titanic.isnull().sum()

In [None]:
###check for missing values in specific colum
titanic["Age"].isnull().sum()

In [None]:
## Data Visualization

In [None]:
###Plot specific column
titanic["Fare"].plot()

In [None]:
import matplotlib.pyplot as plt
plt.bar(titanic['Sex'], titanic['Age'])

In [None]:
plt.bar(titanic['Sex'], titanic['Survived'])

In [None]:
plt.hist(titanic['Age'])