# Python Libraries
- Python packages are collections of Python modules, Python libraries are pre-written code that you can use in your Python programs, and pip is the package manager that allows you to install and manage these libraries and packages.
1. **Python Package:**
   - A Python package is a collection of Python modules organized in a directory structure.
   - Packages allow you to group related Python modules together, making it easier to manage and distribute your code.
   - Packages help in organizing your code and promoting code reuse across different projects.
   - When you install a Python package, it typically includes multiple modules that you can import and use in your Python scripts.

2. **Python Libraries:**
   - Python libraries are collections of pre-written code that you can use in your own Python programs.
   - Some of the popular Python libraries include:
     - NumPy: A library for working with arrays, matrices, and mathematical functions.
     - Pandas: A library for data manipulation and analysis, providing data structures like DataFrames and Series.
     - Matplotlib: A plotting library for creating static, animated, and interactive visualizations in Python.
     - Seaborn: A data visualization library built on top of Matplotlib, providing a high-level interface for drawing attractive and informative statistical graphics.
   - These libraries provide a wide range of functionality, allowing you to focus on solving your specific problem instead of reinventing the wheel.
   - By using these libraries, you can save time and effort, as they provide tested and optimized implementations of common tasks and algorithms.

3. **pip (Pip Installs Packages):**
   - pip is the standard package manager for Python.
   - It is used to install, upgrade, and remove Python packages and their dependencies.
   - When you need to use a Python library or package in your project, you can install it using pip.
   - The command to install a package using pip is: 
     ```
     pip install package_name
     ```
   - pip makes it easy to manage and distribute Python packages, ensuring that you have the necessary dependencies installed for your project to run smoothly.

# Numpy

In [1]:
import numpy as np

In [2]:
list1 = [1, 2, 3, 4, 5, 6, 7]

arr = np.array(list1) # .array() is a method in numpy that takes in a scalar or nexted sequence(list)
print(arr)

[1 2 3 4 5 6 7]


In [3]:
type(arr)

numpy.ndarray

# Shape, Attribute, and Method:
- shape: it gives the number of rows and columns in an array or dataframe. 
- attribute: it gives the metadata information about an object or variable, and it doesn't has bracket. e.g: .shape, .dtypes, .columns, .index, etc.
- method: it is a function associated with object, and it has bracket. e.g: .append(), .index(), .capitalize(), .replace(), etc.

In [4]:
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr1)

[[1 2 3]
 [4 5 6]]


In [5]:
type(arr1)

numpy.ndarray

In [6]:
# confirm the shape
arr1.shape

(2, 3)

In [7]:
arr.shape

(7,)

In [8]:
# .size: it is the product of rows and columns
arr1.size

6

In [9]:
arr.size

7

In [10]:
# .linspace(): it is an inbuilt method that create sequence of evenly space array of numbers.

np.linspace(0, 100, 6)

array([  0.,  20.,  40.,  60.,  80., 100.])

In [11]:
# Generating two arrays using linspace.
array1 = np.linspace(1, 10, 10)
array1

array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])

In [12]:
array2 = np.linspace(11, 20, 10)
array2

array([11., 12., 13., 14., 15., 16., 17., 18., 19., 20.])

In [13]:
# adding the two arrays element-wise
result_addition = array1 + array2
print(result_addition)

[12. 14. 16. 18. 20. 22. 24. 26. 28. 30.]


In [14]:
# subtracting the two arrays element-wise
result_subtract = array2 - array1
print(result_subtract)

[10. 10. 10. 10. 10. 10. 10. 10. 10. 10.]


In [15]:
# multiplying the two arrays element-wise
result_product = array1 * array2
print(result_product)

[ 11.  24.  39.  56.  75.  96. 119. 144. 171. 200.]


# Random 
- **We can randomly generate numbers using numpy**

.random() is a  method that works with other numpy methods like .rand(), .randn(), and .randint() to generate numbers.

1. rand(): it is an inbuilt method in numpy that returns float numbers between 0 and 1.
2. randn(): it is an inbuilt method that create an array of specified shape and fills it with random values as per standard normal distribution.
3. randint(): it returns an integer number selected element from the specified range.

In [16]:
np.random.rand(10) # ==> 10 float numbers between 0 and 1

array([0.83835488, 0.76704524, 0.11698191, 0.96207912, 0.78922517,
       0.51555846, 0.78558096, 0.08776907, 0.9686308 , 0.5548911 ])

In [17]:
np.random.randn(10) # ==> 10 numbers that will take the shape of a normal distribution

array([-0.83587582, -1.78307044,  2.15633562, -1.3030549 ,  0.76997243,
       -0.63395337,  1.15690521,  1.12610155, -1.45217809,  0.56881951])

In [18]:
np.random.randn(10, 5)

array([[ 0.39928388, -1.16606504,  1.16979824, -1.17134687,  0.34974722],
       [-0.72488792,  1.40678679, -1.40285592,  0.95511291, -0.73648343],
       [ 0.96958969,  0.67407221, -1.60223188, -0.53723079, -0.77876365],
       [ 0.65199849, -0.82152914, -1.02073439,  0.17189349, -0.72992158],
       [-1.48588873, -1.21569173,  0.11979128, -0.6225854 , -0.81984021],
       [-0.68795152, -0.83860015,  0.32593607, -0.55586883,  0.15597766],
       [-0.76439344, -0.03955543,  1.65884424, -0.29785051, -0.24031881],
       [-2.02199937,  0.07152361, -0.20469595, -0.67660906,  1.69288321],
       [ 1.21999221,  0.20420898,  0.42225762, -0.74397204, -0.22737031],
       [-0.94434428,  0.73709916,  1.09959204, -0.31633197, -0.28856125]])

In [19]:
np.random.randint(10) # ==>  at a time it generate an integer numnber between 0 and 10 inclusive

1

In [20]:
np.random.randn(10, 5).shape

(10, 5)

# Indexing and slicing in Numpy

In [21]:
print(arr1)

[[1 2 3]
 [4 5 6]]


In [22]:
arr1[0] # considering the rows when you have a single value in your []

array([1, 2, 3])

In [23]:
arr1[1]

array([4, 5, 6])

In [24]:
arr1[1:,]

array([[4, 5, 6]])

In [25]:
arr1[:,1] # considering the columns when you have :, and a single value in your [] i.e [:, value]

array([2, 5])

In [26]:
arr2 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [5, 3, 8]])
print(arr2)

[[1 2 3]
 [4 5 6]
 [7 8 9]
 [5 3 8]]


In [27]:
# shape
arr2.shape

(4, 3)

In [28]:
# arr2[start:, end]
arr2[2:, ]

array([[7, 8, 9],
       [5, 3, 8]])

In [29]:
arr2[:, 2]

array([3, 6, 9, 8])

In [30]:
arr2[1:, 2]

array([6, 9, 8])

In [31]:
arr2[2:, 1]

array([8, 3])

# Descriptive Statistics
- Measure of central tendency (mean, media, and mode)
- Measure of dispersion or variabilty (range, variance, standard deviation, interquartile)

### Measure of central tendency

In [32]:
# mean
print(arr2)

[[1 2 3]
 [4 5 6]
 [7 8 9]
 [5 3 8]]


In [33]:
np.mean(arr2)

5.083333333333333

In [34]:
# median
np.median(arr2)

5.0

In [35]:
np.mode(arr2)

AttributeError: module 'numpy' has no attribute 'mode'

In [36]:
values, counts = np.unique(arr2, return_counts = True)
print(counts)

[1 1 2 1 2 1 1 2 1]


In [37]:
mode = values[np.argmax(counts)]
print(mode)

3


## Measure of Variability

In [38]:
print(arr2)

[[1 2 3]
 [4 5 6]
 [7 8 9]
 [5 3 8]]


In [39]:
# range of arr2
np.ptp(arr2)

8

In [40]:
# variance of arr2
np.var(arr2)

6.076388888888888

In [41]:
# standard deviation of arr2
np.std(arr2, dtype = int)

2

# Pandas
- Series ==> 1 dimension

- DataFrame ===> 2 dimension

In [42]:
import pandas as pd

In [43]:
# pd.Series() ===> (index, data)
    
# pd.DataFrame() ===> (series or index, row, column)

In [44]:
age = [4, 11, 20, 34, 80, 16, 20]

age_series = pd.Series(age)
print(age_series)

0     4
1    11
2    20
3    34
4    80
5    16
6    20
dtype: int64


In [45]:
type(age_series)

pandas.core.series.Series

In [46]:
car = ['Toyota', 'Lexus', 'Honda', 'Range Rover']

series_car = pd.Series(car)
print(series_car)

0         Toyota
1          Lexus
2          Honda
3    Range Rover
dtype: object


In [47]:
type(series_car)

pandas.core.series.Series

## Simple operations with Series
- addition, subtraction, and multiplication

`You can perform simple arithmetic operation if the series has the same number of index.`

In [48]:
# addition: it is the element-wise addition of series
s1 = pd.Series([3, 4, 13, 19], index = ['A', 'B', 'C', 'D'])
s2 = pd.Series([7, 10, 15, 24], index = ['A', 'B', 'C', 'D'])

result = s1 + s2
print(result)

A    10
B    14
C    28
D    43
dtype: int64


In [49]:
# subtraction: it is the element-wise difference of series
s1 = pd.Series([3, 4, 13, 19], index = ['A', 'B', 'C', 'D'])
s2 = pd.Series([7, 10, 15, 24], index = ['A', 'B', 'C', 'D'])

result = s2 - s1
print(result)

A    4
B    6
C    2
D    5
dtype: int64


In [50]:
# multiplication: it is the element-wise product of series
price = pd.Series([3.1, 4.0, 13.5, 19.2], index = ['A', 'B', 'C', 'D'])
quantity = pd.Series([7, 10, 15, 24], index = ['A', 'B', 'C', 'D'])

revenue = price * quantity
print(revenue)

A     21.7
B     40.0
C    202.5
D    460.8
dtype: float64


# Concatenation: 
- we can also concatenate two series using the pd.concat() function. This will create a new Dataframe with the combined series.

In [51]:
price = pd.Series([3.1, 4.0, 13.5, 19.2], index = ['A', 'B', 'C', 'D'])
quantity = pd.Series([7, 10, 15, 24], index = ['A', 'B', 'C', 'D'])

result = pd.concat([price, quantity], axis = 1)
print(result)

      0   1
A   3.1   7
B   4.0  10
C  13.5  15
D  19.2  24


## Indexing and slicing

In [52]:
series_car

0         Toyota
1          Lexus
2          Honda
3    Range Rover
dtype: object

In [53]:
series_car[2]

'Honda'

In [54]:
series_car[:2]

0    Toyota
1     Lexus
dtype: object

In [55]:
series_car[2:]

2          Honda
3    Range Rover
dtype: object

In [56]:
# .values is used to extract the data or values in a series
print(series_car.values)

['Toyota' 'Lexus' 'Honda' 'Range Rover']


In [57]:
print(revenue.values)

[ 21.7  40.  202.5 460.8]


In [58]:
# .index is used to get the index location of the values
print(series_car.index)

RangeIndex(start=0, stop=4, step=1)


In [59]:
print(revenue.index)

Index(['A', 'B', 'C', 'D'], dtype='object')


# Aggregate Functions
- They are used to perform simple statistical operation in Python

`sum(), mean(), min(), max(), and count()`


In [60]:
# .sum()
print(revenue)

A     21.7
B     40.0
C    202.5
D    460.8
dtype: float64


In [61]:
total_revenue = revenue.sum()
print(total_revenue)

725.0


In [62]:
# .mean()
avg_revenue = revenue.mean()
print(avg_revenue)

181.25


In [63]:
# .min()
min_revenue = revenue.min()
print(min_revenue)

21.7


In [64]:
# .max()
max_revenue = revenue.max()
print(max_revenue)

460.79999999999995


In [65]:
# .count()
count_series_car = series_car.count()
print(count_series_car)

4


# DataFrame
`df = pd.DataFrame()`

where df is a variable holding the Dataframe

In [66]:
series_car

0         Toyota
1          Lexus
2          Honda
3    Range Rover
dtype: object

In [67]:
df_car = pd.DataFrame(series_car, columns = ['car brand'])
df_car

Unnamed: 0,car brand
0,Toyota
1,Lexus
2,Honda
3,Range Rover


In [68]:
type(df_car)

pandas.core.frame.DataFrame

In [69]:
print(result)

      0   1
A   3.1   7
B   4.0  10
C  13.5  15
D  19.2  24


In [70]:
df_result = pd.DataFrame(result)
df_result

Unnamed: 0,0,1
A,3.1,7
B,4.0,10
C,13.5,15
D,19.2,24


In [71]:
type(df_result)

pandas.core.frame.DataFrame

In [72]:
# .rename()
df_result.rename(index = {'A':0, 'B':1, 'C':2, 'D':3}, columns = {0:'price', 1:'quantity'}, inplace = True)
df_result

Unnamed: 0,price,quantity
0,3.1,7
1,4.0,10
2,13.5,15
3,19.2,24


# Other functions and methods
- the aggregate functions, .describe(), .select_dtypes(), etc.

In [73]:
df_result

Unnamed: 0,price,quantity
0,3.1,7
1,4.0,10
2,13.5,15
3,19.2,24


In [74]:
# shape of the dataframe
df_result.shape

(4, 2)

In [75]:
# the index of the dataframe
df_result.index

Index([0, 1, 2, 3], dtype='int64')

In [76]:
df_result['quantity'].index

Index([0, 1, 2, 3], dtype='int64')

In [77]:
# the values of the dataframe
df_result.values

array([[ 3.1,  7. ],
       [ 4. , 10. ],
       [13.5, 15. ],
       [19.2, 24. ]])

In [78]:
# Indexing and slicing
df_result['price'].values

array([ 3.1,  4. , 13.5, 19.2])

In [79]:
df_result['price'][:2].values

array([3.1, 4. ])

In [80]:
# first three rows of the df_result
df_result[:3]

Unnamed: 0,price,quantity
0,3.1,7
1,4.0,10
2,13.5,15


In [81]:
df_result['quantity'][:2] # the first two rows of the quantity column

0     7
1    10
Name: quantity, dtype: int64

`N.B: You can do aggregate functions on a whole dataframe, but it will be returned as a series.`

In [82]:
# .sum()
df_result.sum()

price       39.8
quantity    56.0
dtype: float64

In [83]:
df_result['price'].sum()

39.8

In [84]:
# .mean()
df_result.mean()

price        9.95
quantity    14.00
dtype: float64

In [85]:
df_result['quantity'].mean()

14.0

### .describe()
- It is an inbuilt method that generates descriptive statistics for a DataFrame or Series, including the count, mean, standard deviation, minimum, 25th percentile, 50th percentile (median), 75th percentile, and maximum.

In [86]:
# view the summary descriptive statistics
df_result.describe()

Unnamed: 0,price,quantity
count,4.0,4.0
mean,9.95,14.0
std,7.756503,7.438638
min,3.1,7.0
25%,3.775,9.25
50%,8.75,12.5
75%,14.925,17.25
max,19.2,24.0


In [87]:
df_result.describe().astype('int') # .astype() is an inbuilt method used to change data type to another data types

Unnamed: 0,price,quantity
count,4,4
mean,9,14
std,7,7
min,3,7
25%,3,9
50%,8,12
75%,14,17
max,19,24


In [88]:
# you can pass a list of the percentile you want to see
df_result.describe([.50, .75, .90 ]).astype('int') 

Unnamed: 0,price,quantity
count,4,4
mean,9,14
std,7,7
min,3,7
50%,8,12
75%,14,17
90%,17,21
max,19,24


For the price variable, there are 4 data points. The mean price is `$9`, with a standard deviation of `$7`. The minimum price is `$3` and the maximum price is `$19`. The 25th percentile is `$3`, the median price is `$8`, and the 75th percentile is $14.

For the quantity variable, there are also 4 data points. The mean quantity is 14, with a standard deviation of 7. The minimum quantity is 7 and the maximum quantity is 24. The 25th percentile is 9, the median quantity is 12, and the 75th percentile is 17.

This data suggests a wide range of prices from `$3` to `$19`, with most prices clustered around the mean of `$9`. The quantity data also shows a wide range from 7 to 24, with the middle 50% of values falling between 9 and 17.

# DataFrame loc and iloc
- DataFrame loc

The .loc[] method allows us to extract data along the rows and columns using labeled index.



In [89]:
# retrieve 24 using loc from the quantity
df_result.loc[3, 'quantity']

24

In [90]:
# retrieve the first three values using loc from the quantity
df_result.loc[:2, 'quantity']

0     7
1    10
2    15
Name: quantity, dtype: int64

- DataFrame iloc

The .iloc[] method allows us access data using specific row and column indices. It is called integer-based indexing.

In [91]:
df_result.iloc[3, 1]

24

# Building a DataFrame using Dictionary

In [92]:
customer = {'Customer ID': [1, 2, 3, 4, 5],
    'Customer Name': ['Deji Ade', 'Sola Omo', 'Taiwo Scott', 'Bimbo Odu', 'Prince Charm'],
           'Department':['HR', 'IT', 'Product', 'Customer Care', 'Logistics'],
           'Salary': [120000, 500000, 300000, 200000, 150000],
           'Grade level': ['Executive', 'Senior', 'Mid-level', 'Junior', 'Entry']}

print(customer)

{'Customer ID': [1, 2, 3, 4, 5], 'Customer Name': ['Deji Ade', 'Sola Omo', 'Taiwo Scott', 'Bimbo Odu', 'Prince Charm'], 'Department': ['HR', 'IT', 'Product', 'Customer Care', 'Logistics'], 'Salary': [120000, 500000, 300000, 200000, 150000], 'Grade level': ['Executive', 'Senior', 'Mid-level', 'Junior', 'Entry']}


In [93]:
type(customer)

dict

In [94]:
# pass the customer into a DataFrame
df = pd.DataFrame(customer)
df

Unnamed: 0,Customer ID,Customer Name,Department,Salary,Grade level
0,1,Deji Ade,HR,120000,Executive
1,2,Sola Omo,IT,500000,Senior
2,3,Taiwo Scott,Product,300000,Mid-level
3,4,Bimbo Odu,Customer Care,200000,Junior
4,5,Prince Charm,Logistics,150000,Entry


In [95]:
# to retrieve 'Product' from department
df.iloc[2, 2]

'Product'

In [96]:
# filtering across the rows using 1 and 4 indices
df.iloc[1:4]

Unnamed: 0,Customer ID,Customer Name,Department,Salary,Grade level
1,2,Sola Omo,IT,500000,Senior
2,3,Taiwo Scott,Product,300000,Mid-level
3,4,Bimbo Odu,Customer Care,200000,Junior


In [97]:
# filter rows where 'Salary' is between 200k and 500k (inclusive) # & ==> AND while | ==> OR
filtered_df = df.loc[(df['Salary'] >= 200000) & (df['Salary'] <= 500000)]
print(filtered_df)

   Customer ID Customer Name     Department  Salary Grade level
1            2      Sola Omo             IT  500000      Senior
2            3   Taiwo Scott        Product  300000   Mid-level
3            4     Bimbo Odu  Customer Care  200000      Junior


In [98]:
# filter rows where 'Salary' is 150k AND/OR 'Grade level' is Senior
filtered_df1 = df.loc[(df['Salary'] == 150000) | (df['Grade level'] == 'Senior')]
print(filtered_df1)

   Customer ID Customer Name Department  Salary Grade level
1            2      Sola Omo         IT  500000      Senior
4            5  Prince Charm  Logistics  150000       Entry


In [99]:
# filter rows where 'Salary' is 150k AND/OR 'Grade level' is Senior
filtered_df1 = df.loc[(df['Salary'] == 150000) & (df['Grade level'] == 'Senior')]
print(filtered_df1)

Empty DataFrame
Columns: [Customer ID, Customer Name, Department, Salary, Grade level]
Index: []


In [100]:
# using .iloc
df.iloc[:4]

Unnamed: 0,Customer ID,Customer Name,Department,Salary,Grade level
0,1,Deji Ade,HR,120000,Executive
1,2,Sola Omo,IT,500000,Senior
2,3,Taiwo Scott,Product,300000,Mid-level
3,4,Bimbo Odu,Customer Care,200000,Junior


In [105]:
df.iloc[[2, 4]]

Unnamed: 0,Customer ID,Customer Name,Department,Salary,Grade level
2,3,Taiwo Scott,Product,300000,Mid-level
4,5,Prince Charm,Logistics,150000,Entry


# Fetching Categorical and Numerical Columns
- We use .select_dtypes() method to either fetch categorical columns or numerical columns.

In [102]:
# to fetch categorical columns from our DataFrame, df
cat_vars = df.select_dtypes(include = ['object'])
print(cat_vars)

  Customer Name     Department Grade level
0      Deji Ade             HR   Executive
1      Sola Omo             IT      Senior
2   Taiwo Scott        Product   Mid-level
3     Bimbo Odu  Customer Care      Junior
4  Prince Charm      Logistics       Entry


In [113]:
# to extract the columns only, we .columns, and to have it a list, we use .tolist()
cat_vars.columns

Index(['Customer Name', 'Department', 'Grade level'], dtype='object')

In [114]:
cat_vars.columns.tolist()

['Customer Name', 'Department', 'Grade level']

In [103]:
# to fetch numerical columns from our DataFrame, df
num_vars = df.select_dtypes(include = ['number'])
print(num_vars)

   Customer ID  Salary
0            1  120000
1            2  500000
2            3  300000
3            4  200000
4            5  150000


## .unique() and .nunique()

In [108]:
# .unique(): it is an inbuilt method that returns the distinct values in a column.
df['Department'].unique()

array(['HR', 'IT', 'Product', 'Customer Care', 'Logistics'], dtype=object)

In [112]:
# to return it into a list, we use .tolist() 
df['Department'].unique().tolist()

['HR', 'IT', 'Product', 'Customer Care', 'Logistics']

In [116]:
# .nunique(): it is an inbuilt method that returns the number or count of distinct values.
df.nunique()

Customer ID      5
Customer Name    5
Department       5
Salary           5
Grade level      5
dtype: int64

In [117]:
df['Department'].nunique()

5