## Python Toolbox

<ul>
  <li>Defining a function</li>
  <li>Function parameters </li>
  <li>Docstrings</li>
    <ul>
      <li>Docstrings describe what your function does</li>
      <li>In between triple double quotes """</li>
    </ul>
  <li>Returning multiple values with <b>Tuples</b></li>
    <ul>
      <li>Immutable - can’t modify values!</li>
      <li>Constructed using parentheses()</li>
    </ul>
  <li>Scope: global</li>
  <li>Nested functions</li>
  <li>Using nonlocal</li>
    <ul>
      <li> The <b>nonlocal</b> keyword is used to work with variables inside nested functions, where the variable should not belong to the inner function.
      </li>
    </ul>
  <li>Add a default argument</li>
  <li>Flexible arguments: *args</li>
    <ul>
      <li> <b>Note!:</b> that args is just a name. You’re not required to use the name args. You can choose any name that you prefer
      </li>
    </ul>
  <li>Flexible arguments: **kwargs</li>
    <ul>
      <li> works just like *args, but instead of accepting positional arguments it accepts keyword (or named) arguments.
      </li>
    </ul>
  <li>Lambda functions <b>(Anonymous functions)</b></li>
    <ul>
      <li> Syntax: lambda arguments : expression</li>
    </ul>
  <li>map Function</li>
    <ul>
      <li> takes two arguments: map(func, seq)</li>
    </ul>
  <li>Errors and exceptionsn</li>
    <ul>
      <li> raise</li>
      <li> try</li>
      <li> except</li>
      <li> finally (optional)</li>
    </ul>
    
  
</ul>






In [None]:
# Defining a function
def add(): # <- Function header
  sum = 2 + 2 # <- Function body
  return sum

add()

# Function parameters
def add(para): # <- Function header
  sum = para + 2 # <- Function body
  return sum

add(para)

# Docstrings
def add(para): # <- Function header
  """Retrun the sum of para and 2"""
  sum = para + 2 # <- Function body
  return sum

# Returning multiple values with Tuples
def add(para1, para2): # <- Function header
  sum1 = para1 + para2 # <- Function body
  sum2 = para1 + 2 * para2 
  new_tuple = (sum1, sum2)
  return new_tuple

add(para1, para2)

# Scope: global
x = 300

def myfunc():
  x = 200
  print(x) # Output 200

myfunc()

print(x) # Output 300
-----------------------
x = 300
def myfunc():
  global x
  x = 200

myfunc()

print(x) # Output 200

# Nested function
def outer(msg):
    # This is the outer enclosing function

    def inner():
        # This is the nested function
        print(msg)

    inner()

outer("Hello") # Output: Hello

# Using nonlocal 
def outer():
  """Prints the value of n."""    
  n = 1

  def inner():
    nonlocal n
    n = 2        
    print(n) # Output 2
    
    inner()
    print(n) Output 2

outer()

# Add a default argument
def greet(name, msg="Good morning!"):
    """
    This function greets to
    the person with the
    provided message.

    If the message is not provided,
    it defaults to "Good
    morning!"
    """

    print("Hello", name + ', ' + msg)


greet("Kate") # Output: Hello Kate, Good morning!
greet("Bruce", "How do you do?") # Output: Hello Bruce, How do you do?

# Flexible arguments: *args
def my_sum(*args):
    result = 0
    # Iterating over the Python args tuple
    for x in args:
        result += x
    return result

print(my_sum(1, 2, 3))

# Flexible arguments: **kwargs
def concatenate(**kwargs):
    result = ""
    # Iterating over the Python kwargs dictionary
    for arg in kwargs.values():
        result += arg
    return result

print(concatenate(a="Real", b="Python", c="Is", d="Great", e="!"))

# Lambda functions 
(lambda x: x + 1)(2) # Output: 3
-------
raise_to_power = lambda x, y: x ** y
raise_to_power(2, 3) # Output: 8

# map Function
numbers = (1, 2, 3, 4)
result = map(lambda x: x + x, numbers)
"""Double all numbers using map and lambda"""
print(list(result))

# Errors and exceptions
try:
	print("code start")
 
	print(1 / 0) # put unsafe operation in try block

# if error occur the it goes in except block
except:
	print("an error occurs")

# final code in finally block
finally:
	print("Hello world")



## Import Packages and Libraries
This is the lineup of the most important Python libraries for data analytics:

### Data Processing and Modeling:

*   NumPy
*   Pandas

### Data Visualization

*   Matplotlib
*   Seaborn

**Note**❗: Import only the packages which will be used!

In [None]:
import numpy as np #np is an alias pointing to Numpy
import pandas as pd #pd is an alias pointing to Pandas

import matplotlib.pyplot as plt #plt is an alias pointing to Matplotlib
import seaborn as sns #sns is an alias pointing to Seaborn

## Read the Dataset
Import a dataset often depends on the format of the file (Excel, CSV, text, SPSS, Stata, etc.). These are the most important ways to read the datasets:

**Notes**❗: 

*   CSV stands for comma-separated values.
*   You need to pay attention when writing the path of the datatset.







In [None]:
dataFrame = pd.read_csv('path/file.csv')
dataFrame = pd.read_excel('path/file.xlsx')

## Inspecting the Data Frame

The main intentions of inspecting our data are:

*   To have an idea of the size of the dataset.
*   To get the data type of each variable in the dataset.
*   To identify whether there are missing values in the dataset.

And also for another reasons...

**Note**❗: shape is Attribute and not methods like head, info and describe.


In [None]:
# Print the head of the Data Frame
print(dataFrame.head())

# Print information about Data Frame
print(dataFrame.info())

# Print the shape of Data Frame
print(dataFrame.shape)

# Print a description of Data Frame
print(dataFrame.describe())

df.shape # Attribute
df.head()# Method

## Sort the Rows in the Data Frame

Pandas `sort_values()` method sorts a data frame in Ascending or Descending order of passed Column.

**Note**❗: `sort_values()` sorts Ascending by deafult.

In [None]:
# One column
dataFrame.sort_values("col 1", ascending=True)

# Multiple columns
dataFrame.sort_values(["col 1", "col 2"], ascending=[True, False])

## Explicit Indexes

*   Columns and Index
*   Setting a column as the Index
*   Subsetting with Index
*   Sort Index


In [None]:
# Columns and Index
dataFrame.columns
dataFrame.index

# Setting a column as the Index
dataFrame_ind = dataFrame.set_index("col")

# Subsetting with Index
dataFrame[dataFrame["col"].isin(["row 1", "row 2"])] # First way
dataFrame.loc[["row 1", "row 2"]] # Second way

# Sort Index
dataFrame_index = dataFrame.set_index(["col 1", "col 2"])
dataFrame_index.sort_index(level=["col 1", "col 2"], ascending=[True, False])

## Slicing and Subsetting

*   Slicing columns
*   Slicing by dates
*   Subsetting by row/column number
*   Subsetting with conditions
*   Subsetting rows by categorical variables
*   Adding a new column
*   Writing to CSV-File

**Notes**❗:

*   The Data frame should be sorted before slicing.
*  **i**loc: i for integer
*   Logical Operators in Pandas are (&, | and ~)
*   The parentheses (...) is **important**❗










In [None]:
# Slicing columns
dataFrame_sorted.loc[:, "col_index 1":"col_index 2"]

# Slicing by date
dataFrame_sorted.loc["yyyy-mm-dd":"yyyy-mm-dd"]

# Subsetting by row/column number
dataFrame.iloc[3:6, 1:5])

# Subsetting with conditions
SubSet = dataFrame[(dataFrame['col 1']<1000)& (dataFrame['col 2'] == 'value in col 2')]

# Subsetting rows by categorical variables
listName = ["value 1", "value 2", "value 3", "value 4"] # List of values in a column
SubSet = dataFrame[dataFrame["col"].isin(listName)]

# Adding a new column
dataFrame["new column"] = dataFrame["col 1 with integer values"] / dataFrame["col 2 with integer values"] 

# Writing to CSV-File
dataFrame.to_csv("dataFrame_with_new_column.csv")

## Visualizing the Data



*   Matplotlib
*   Putting two or more data frame in one Plot
*   Adding markers
*   Setting the line style
*   Choosing color
*   Customizing the axes labels
*   Adding a title
*   Small multiples with plt.subplots
*   Zooming in on a decade
*   Zooming in on one year
*   Using twin axes
*   Coloring the ticks
*   A function that plots time-series
*   Adding arrows to annotation
*   Rotate the tick labels
*   Adding a legend
*   Adding error bars to bar charts
*   Choosing a style
*   Figure Size
*   Using Matplotlib for geospatial data
*   Seaborn example gallery
*   Histograms
*   Barplots
*   Lineplots
*   Scatterplots
*   Boxplot **(Comparison)**
*   Pairplot
*   Saving the figure to file

**Notes**❗:

* Adjust the number of bars, or bins, using the **"bins"** argument. Increasing or decreasing this can give us a better idea of what the distribution looks like.

*  Pairplot will pair every variable in the data set to another one and give an overview of how they affect each other.



In [None]:
# Matplotlib
fig, ax = plt.subplots()
ax.plot(dataFrame["colName"], dataFrame["colName"])

# Putting two or more data frame in one Plot
ax.plot(dataFrame1["colName"], dataFrame1["colName"])
ax.plot(dataFrame2["colName"], dataFrame2["colName"])

# Adding markers
ax.plot(dataFrame1["colName"], dataFrame1["colName"], marker="o")
ax.plot(dataFrame2["colName"], dataFrame2["colName"], marker="v")

# Setting the line style
ax.plot(dataFrame["colName"], dataFrame["colName"], marker="v", linestyle="--")
ax.plot(dataFrame["colName"], dataFrame["colName"], marker="v", linestyle="None")

# Choosing color 
ax.plot(dataFrame["colName"], dataFrame["colName"], marker="v", linestyle="--", color="r") # r for red and b for blue

# Customizing the axes labels
ax.set_xlabel("nameXLabel")
ax.set_ylabel("nameYLabel")

# Adding a title
ax.set_title("nameTitle")

# Small multiples with plt.subplots
fig, ax = plt.subplots(3, 2) # For example 3 rows and 2 columns

ax[0, 0].plot(dataFrame["colName"], dataFrame["colName"], marker="v", linestyle="--", color="r") # ax[0, 0] The first row and first column
ax[0, 1].plot(dataFrame["colName"], dataFrame["colName"], marker="v", linestyle="--", color="r") # ax[0, 1] The first row and second column

# Zooming in on a decade
seventies = dataFrame["1970-01-01":"1979-12-31"] # Example

ax.plot(seventies.index, seventies['colName'])

# Zooming in on one year
fifty_nine = dataFrame["1959-01-01":"1959-12-31"]
ax.plot(fifty_nine.index, fifty_nine['colName'])

# Using twin axes
ax.plot(dataFrame.index, dataFrame['colName1'])
ax2 = ax.twinx()
ax2.plot(dataFrame.index, dataFrame["colName2"])

# Coloring the ticks
ax.tick_params('y', colors='blue')
ax2.tick_params('y', colors='red')

# A function that plots time-series
def plot_timeseries(axes, x, y, color, xlabel, ylabel):  
  axes.plot(x, y, color=color)  
  axes.set_xlabel(xlabel)  
  axes.set_ylabel(ylabel, color=color)  
  axes.tick_params('y', colors=color)

# Adding arrows to annotation
ax2.annotate(">1 degree", xy=(pd.Timestamp('2017-10-05'), 1), xytext=(pd.Timestamp('2007-09-04'), -0.2), arrowprops={"arrowstyle":"->", "color":"gray"})

# Rotate the tick labels
ax.set_xticklabels(DataFrame.index, rotation=90)

# Adding a legend
ax.legend()

# Adding error bars to bar charts
ax.bar("nameLabel", DataFrame1["colName"].mean(), yerr=DataFrame1["colName"].std())
ax.bar("nameLabel", DataFrame2["colName"].mean(), yerr=DataFrame2["colName"].std())

# Choosing a style
plt.style.use("ggplot")
plt.style.use("bmh")
plt.style.use("seaborn-colorblind")

# Figure Size
fig.set_size_inches([6, 4]) # Examle 6 width 4 height

# Using Matplotlib for geospatial data
# https://scitools.org.uk/cartopy/docs/latest/

# Seaborn example gallery
# Pandas + Matplotlib = Seaborn
# https://seaborn.pydata.org/examples/index.html


# Histograms
dataFrame["col"].hist()
dataFrame["col"].hist(bins=30)

dog_pack[dog_pack["col 1"]=="value in col 1"]["col 2"].hist(alpha=0.7)
dog_pack[dog_pack["col 1"]=="value in col 1"]["col 2"].hist(alpha=0.7)
plt.legend(["value in col 1", "value in col 1"])


# Barplots
dataFrame.plot(kind="bar", title="Title of the Plot")

# Lineplots
dataFrame.plot(x="col 1", y="col 2", kind="line")

# Scatterplots
dataFrame.plot(x="col 1", y="col 2", kind="scatter")

sns.scatterplot(x='col 1', 
                y='col 2',
                hue='col 3', # This will be plotted in different colors.
                data=dataFrame)

plt.show() # To show the Plot

# Boxplot
sns.boxplot(x='col 1', 
            y='col 2',
            hue='col 3', # This will be plotted in different colors.
            data=dataFrame)

# Pairplot
sns.pairplot(dataFrame, hue='col'))

# Saving the figure to file
fig.savefig("nameImage.png", quality=50, dpi=300)

## Data Visualization using Seaborn

> Works well with `pandas` data structures

> Built on top of `matplotlib`


---
**Difference between Scatter plots, Line plots Count/Bar plots:**

*   Scaerplots: Each plot point is an independent observation
*   Lineplots: Each plot point represents the same "thing", typically tracked
over time
*   Count/Bar plots: Comparisons between groups
*   Boxplot: Shows the distribution of quantitative data, See median, spread, skewness, and outliers and Facilitates comparisons between groups
---

*   Scatter plot
*   Count plot
*   Relational plot
*   Subplots in columns
*   Subplots in rows
*   Subplots in rows and columns
*   Ordering columns
*   Subgroups with point size
*   Line plot
*   countplot() vs. catplot()
*   Boxplot
*   Figure style
*   Changing the palette: Figure "palette" changes the color of the main elements of theplot
*   Changing the scale ofthe plot elements and labels




In [None]:
hue_colors = {"value1": "black", "value2": "red"} # For example black and red

# Scatter plot
sns.scatterplot(x="colName", y="colName", data=dataFrame, hue=colName, hue_order=[value1, value2], palette=hue_colors) 

# Count plot
sns.countplot(x="colName", data=dataFrame, hue=colName)

# Relation plot
sns.relplot(x="colName", y="colName", data=dataFrame, kind="scatter")

# Subplots in columns
sns.relplot(x="colName", y="colName", data=dataFrame, kind="scatter", col="namePlot")

# Subplots in rows
sns.relplot(x="colName", y="colName", data=dataFrame, kind="scatter", row="namePlot")

# Subplots in rows and columns
sns.relplot(x="colName", y="colName", data=dataFrame, kind="scatter", col="namePlot", row="namePlot")

# Ordering columns
sns.relplot(x="colName", y="colName", data=dataFrame, kind="scatter", col="namePlot", col_order=["", "", ""])

# Subgroups with point size
sns.relplot(x="colName", y="colName", data=dataFrame, kind="scatter", size="size", hue="size")

# Line plot
sns.relplot(x="colName1", y="colName2", data=dataFrame, kind="line", style="", hue="", markers=True, dashes=False, ci="sd") # sd for interval with standarddeviation

# countplot() vs. catplot()
sns.countplot(x="colName1", data=dataFrame)

sns.catplot(x="colName1", data=dataFrame, kind="count", order=category_order)

# Boxplot
g = sns.catplot(x="colName1", y="colName2", data=tips, kind="box", order=["", ""], sym="") # Omitting the outlier susing `sym`

# Figure style
sns.set_style("whitegrid") # "white", "dark", "whitegrid", "darkgrid", "ticks"

# Changing the palette
sns.set_palette("RdBu")

# Changing the scale ofthe plot elements and labels
sns.set_context("talk") #"paper","notebook","talk","poster"

## Missing Values

*   Detecting missing values
*   Detecting any missing values
*   Counting missing values
*   Removing missing values
*   Replacing missing values



In [None]:
# Detecting missing values
dataFrame.isna() # It returns True or False

# Detecting any missing values
dataFrame.isna().any()

# Counting missing values
dataFrame.isna().sum()

# Removing missing values
dataFrame.dropna()

# Replacing missing values
dogs.fillna(0)

## Important Defentions

### KPI: 

> KPI (Key Performance Indicator) is a type of performance measurement. KPIs evaluate the success of an organization or of a particular activity (such as projects, programs, products and other initiatives) in which it engages.

Depending on the project you are working with it could be:
*  number of new customers per month
*  sum of the orders value per day
*  Net Promoter Score (NPS)

And many many more...

### Model Bucket:

*   Change over time
*   **Comparison**
*   **Part of a whole**
*   **A Correlation**
*   **Ranking**
*   **Distribution**
*   Flows and relationships
*   Geospatial

### Numerical Data

*   Cumulative sum
*   Sum
*   Median
*   Minimum
*   Maximum
*   Standard deviation
*   Quantil 
*   Mode
*   Count Values
*   Variance

### APIs and Data Structures

> API is the acronym for Application Programming Interface, which is a software intermediary that allows two applications to talk to each other. Each time you use an app like Facebook, send an instant message, or check the weather on your phone, you're using an API.


*   What kind of data can you get from the API?
*   Does the data contain Lists? What is the content of these Lists?
*   Does the data contain Dictionaries? What is the content of the Dictionaries?

### Business Intelligence tools

*  **Tableau**
*  **Looker**
*  Metabase
*  Periscope
*  Plotly Dash
*  Datapine
*  Zoho Analytics
*  **PowerBI**



In [None]:
# Median
dataFrame["col"].median()

# Mode
dataFrame.mode()

# Minimum
dataFrame["col"].min()

# Maximum
dataFrame["col"].max()

# Unbiased Variance
dataFrame.var() # unbiased variance

# Standard deviation
dataFrame["col"].std()

# Sum
dataFrame.sum()

# Quantil
dataFrame.quantile()
 
# Count Values
dataFrame["col"].value_counts(sort=True) # Sort is False by default.

## Grouped Summary Statistics

*   Grouped summaries
*   Multiple grouped summaries
*   Groupingby multiple variables
*   Many groups, many summaries





In [None]:
# Grouped summaries
dataFrame.groupby("col 1")["col 2"].mean() # min(), max() or sum can be used instead of mean()

# Multiple grouped summaries
dataFrame.groupby("col 1")["col 2"].agg([min, max, sum])

#Groupingby multiple variables
dataFrame.groupby(["col 1", "col 2"])["col 3"].mean()

# Many groups,many summaries
dataFrame.groupby(["col 1", "col 2"])[["col 3", "col 4"]].mean()

## Selecting Data with Query

*   The .`query()` Method
*   Querying on  asingle condition
*   Querying on a multiple conditions, "and", "or"
*   Using `.query()` to select text

**Note**❗: The `.query()` Method accepts an input string: 
 

*   Input string used to determine what rows are returned
*   Input string similar to statement after **WHERE** clause in **SQL** statement




In [None]:
# The .query() Method
dataFrame.query('SOME SELECTION STATEMENT')

# Querying on asingle condition
dataFrame.query('col >= int')

# Querying on a multiple conditions, "and", "or"
dataFrame.query('col 1 > int and col 2 < int')
dataFrame.query('col 1 > int or col 2 < int')

# Using .query() to select text
dataFrame.query('col 1=="value(string)" or (col 1=="value(string)" and col 2 < int)')