Here are some definitions of the terminology we will use in the workshop.
* Jupyter notebook or simply notebook
* Python language
    * Common commands
* Packages


Jupyter Notebook

* Notebooks are used to run Python commands
* Their interactive feature is found useful by programmers
* One of the most common tools used by Python programmers


Python Language

* The language belongs to a class of programming languages called interpretive languages
* The code is not compiled
* code is written and executed directly
* Python is still the top programming language of choice used by developers
* Python is also used to build websites and software applications.

* Example:
```
# code assigns value 5 to a variable `x`
x=5
# print the value of the variable `x`
print(x)
```
gives the result
```
5
```


Python Packages

* Python is an extensive language with many functionality built-in to it.
* Some features are in packages
* Examples are special dataframe features

```
import pandas as pd
```
* The above command imports dataframe manipulation package called pandas



Python packages that we will use in this course are:
* pandas: A dataframe manipulation package
* plotly: Interactive plotting package
* numpy: Numeric manipulation package


*Python under the hood*

* Variables
* Data types
* Loops
* Conditional statements
* Tables and common operations on them
    * Merge, sort, group
* Functions

**Variables**

Used to store data in RAM
* Can store a single value, series of values, tables, objects etc.
* A variable is created when a value is assigned to it.
* Using meaningful names makes the code more easy to understand.



In [None]:
### Example variables
# set variable x to 20 and print its value
x = 20
print(x)

# set variable x to string "It is data analytics class" and print its value
x = 'It is data analytics class'
print(x)



20
It is data analytics class


**Datatypes**

* Data types indicate whether a data is a string, float, integer or object
* Data types are inferred by the Python interpreter

In [None]:
# Example data
# Set variable x to 30 and print its data type
x = 30
print(type(x)) # the variable is of type integer

# Set variable `textvar` to "It is data analysis class and prints its data type
textvar = "It is data analysis class"
print(type(textvar)) # the variable is of type string

# Set variable `mylist` to [0,20,30,50] and print its data type
mylist = [0,20,30,50]
print(type(mylist)) # the variable is of type list

# Set variable `arr` to [0,20,30,50] and print its data type
import numpy as np
arr = np.array(mylist) # the variable is of type arrays
print(type(arr))


<class 'int'>
<class 'str'>
<class 'list'>
<class 'numpy.ndarray'>


The data types printed above are integer, string, list and array, respectively.

**NOTE: DO NOT NAME YOUR VARIABLES WITH THE DATATYPE SUCH AS list, int, str etc.**

**Loops**
* Loops are used to automate repeated tasks over a series of values
* Example series are arrays, lists and rows of tables


In [None]:
# Loop or iterate over a list and prints the values
mylist = [0,20,30,50] # define the list
for val in mylist: # loop over it
    print(val)

# This is also called a for loop because it uses the command `for`
# Exercise: Try to loop over an array.

0
20
30
50


**Conditional Statements**

* These statements are found to select a subset of data that meet a particular condition
* Example, select all data that belongs to 2024
* Consists of the command to indicate it is a condition and then the condition
* Conditional commands are `if`, `else`, `elif`
* Conditions are `>`, `<`, `==` etc.

In [None]:
# From the list, mylist = [0,20,30,50], select values that are less than 30 and write them to a new list
filtered_list = [] # create an empty list
for val in mylist: # loop through the list
    if val < 30: # select values less than 30
        filtered_list.append(val) # append values to the list if condition is met
filtered_list # show the values

[0, 20]

**Tables and the common operations on them**

* Pandas is a table manipulation package
* Pandas dataframe anatomy:
    * Dataframe: refers to the table object
    * Row: each row of the table
    * Column: each column in the table
    * Datatype: each column has a data type
        * Data type is implied and can be changed
        

In [None]:
import pandas as pd
# We will create an example table containing various data types
mydict = {'name':['John','Joe'], 'age':[20,21],
          'email_id':['john@companyemail.com','joe@companyemail.com']} # define a dictionary of values
df = pd.DataFrame(mydict) # define the dataframe
print(df.dtypes) # print data type
df # print the values of the dataframe

name        object
age          int64
email_id    object
dtype: object


Unnamed: 0,name,age,email_id
0,John,20,john@companyemail.com
1,Joe,21,joe@companyemail.com


Example of grouping in Pandas

In [None]:
import pandas as pd  # Once imported, do not need to import a package again.

mydict = {'name':['John','Joe','Emma'], 'age':[20,21,20],
          'email_id':['john@companyemail.com','joe@companyemail.com','emma@companyemail.com']} # define a dictionary of values
df = pd.DataFrame(mydict) # define the dataframe
print(df)
# group data by age and count the number of people in each age group
df_group = df.groupby('age')['name'].count()
print("\nGroup by age:")
print(df_group) # print the result

   name  age               email_id
0  John   20  john@companyemail.com
1   Joe   21   joe@companyemail.com
2  Emma   20  emma@companyemail.com

Group by age:
age
20    2
21    1
Name: name, dtype: int64


Example of sorting a dataframe

In [None]:
import pandas as pd  # Once imported, do not need to import a package again.

mydict = {'name':['John','Joe','Emma'], 'age':[20,21,20],
          'email_id':['john@companyemail.com','joe@companyemail.com','emma@companyemail.com']} # define a dictionary of values
df = pd.DataFrame(mydict) # define the dataframe
# Sort the dataframe by age
df_group = df.sort_values('age') # equal age are listed in random order
print(df_group) # print the result

   name  age               email_id
0  John   20  john@companyemail.com
2  Emma   20  emma@companyemail.com
1   Joe   21   joe@companyemail.com


Merge operation is used to merge two dataframes on a key.

**Functions**

* Functions define a set of commands to execute over
* Functions reduce clutter in your code, useful to loop over and apply to multiple rows of a table
* Helps reduce bugs by defining common code once

In [None]:
import pandas as pd
# define an example dataframe
mydict = {'name':['John','Joe','Emma'], 'age':[20,21,20],
          'email_id':['john@companyemail.com','joe@companyemail.com','emma@companyemail.com'],\
          'salary_per_year':[56000,54000,55000],
        'dept_score':[0.1,0.12,0.14]  } # define a dictionary of values
df = pd.DataFrame(mydict) # define the dataframe

# Create a function to calculate bonus package
def calculate_bonus(df,company_score):
    """Function takes company score
        and department scores to calculate the bonus as a percent of the salary.
        Args:
        df: Dataframe that contains salary information
        company_score: score on how well the company did on achieving its target goals for the year
        Returns:
        Dataframe with bonus calculated for each employee and their salary after bonus
    """
    df['bonus'] = df["salary_per_year"]*df["dept_score"] # multiply by department score
    df['bonus'] = df['bonus']+df["salary_per_year"]*company_score # add company score
    df['salary_with_bonus'] = df['bonus'] + df['salary_per_year'] # calculate salary with bonus
    return df
company_score = 0.1 # input the score for the company
calculate_bonus(df=df,company_score=company_score)



 Function to calculate the bonus:


Unnamed: 0,name,age,email_id,salary_per_year,dept_score,bonus,salary_with_bonus
0,John,20,john@companyemail.com,56000,0.1,11200.0,67200.0
1,Joe,21,joe@companyemail.com,54000,0.12,11880.0,65880.0
2,Emma,20,emma@companyemail.com,55000,0.14,13200.0,68200.0


The same company score is applied to all data. And calculation is written inside a function. The result is returned.

**Summary**

Terminology:
* Jupyter notebook or simply notebook: An interactive interface to run Python code. Google Colab notebook is one such interactive interface.
* Python language: Is an interpretive language. No need to compile. A popular programming language.
    * We talked about some of the common commands
* Packages: Packages have custom functionality for a given task. Example is Pandas package.


Python under the hood:
* Variables: Store values in memory.
* Data types: Strings, integers, lists, arrays etc.
* Loops: Used to run repeated code over a list or array or table.
* Conditional statements: Used to select data based on a condition.
* Tables: Rows and columns are manipulated to determine the answer.
    * Merge, sort, group
    * Pandas package is most commonly used to manipulate tables.
* Functions:
    * Reduce cluster
    * Re-usable code is placed in a function (reduces bugs, keeps the code modular)
