# Introduction to Python programming for data analysis

**Getting Started with Python.**
Visit https://docs.python.org/3/ for official documentation.

Python is a popular choice for data analysis for several reasons:

1.   Easy to learn and use: Python is a language that is easy to learn and use, making it accessible to beginners and non-programmers
2.   Large ecosystem of libraries: Python has a vast collection of libraries and modules that make data analysis tasks easier and more efficient. For example, NumPy and Pandas are two popular libraries that are widely used for numerical computation and data manipulation, respectively. There are also libraries available for data visualization, machine learning, and deep learning, making Python a versatile language for data analysis.
3.   Cross-platform compatibility: Python is a cross-platform language, which means that it can run on multiple operating systems, such as Windows, macOS, and Linux.
4.   Integration with other tools: Python can easily integrate with other tools and platforms, making it a popular choice for data analysis in conjunction with other software. For example, Python can be used with SQL databases, Hadoop clusters, and cloud-based services like Amazon Web Services (AWS) and Google Cloud Platform (GCP).





In [None]:
# My first program

print("Hello World")

Hello World


**Variables and data types in Python.
Here are some examples of declaring and using variables with different data types in Python:**

In [None]:
# Integer variable
x = 10

# Floating-point variable with 2 decimal places
y = 3.14

# String variable
name = "Govind"
name_2 = 'Prashant'

# Boolean variable "True" or "False"
is_active = True

# List variable
fruits = ["apple", "banana", "orange"]
students = ["gaurav", 12, "India"]

# Dictionary variable
person = {"name": "Govind", "age": 32}

# Set variable have non repeating values
my_set = {1, 2, 3}

# Tuple variable immutable list
my_tuple = (1, 2, 3)

**Below are some examples of basic arithmetic and mathematical operations in Python:**

In [None]:
# Addition (+): This operator is used to add two or more numbers together.
a = 5
b = 3
c = a + b
print(c)

8


In [None]:
# Subtraction (-): This operator is used to subtract one number from another.
a = 5
b = 3
c = a - b
print(c) 

2


In [None]:
# Multiplication (*): This operator is used to multiply two or more numbers together.
a = 5
b = 3
c = a * b
print(c) 

15


In [None]:
# Division (/): This operator is used to divide one number by another.
a = 5
b = 3
c = a / b
print(c)  

1.6666666666666667


In [None]:
# Floor Division (//): This operator is used to divide one number by another and returns the floor of the result.
a = 5
b = 3
c = a // b
print(c)  

1


In [None]:
# Modulus (%): This operator is used to get the remainder of a division operation.
a = 5
b = 3
c = a % b
print(c) 

2


In [None]:
# Exponentiation (**): This operator is used to raise a number to a power.
a = 2
b = 3
c = a ** b
print(c)

8


**Python provides built-in data structures: lists, tuples, set and dictionaries, which can be used to store and manipulate data in various ways.**

In [None]:
# Lists: A list is a mutable collection of values that are ordered and can be accessed by an index. Lists are created using square brackets [].
my_list = [1, 2, 3, 4, 5, 'Apple', 'Ball']
print(my_list)

[1, 2, 3, 4, 5, 'Apple', 'Ball']


In [None]:
# Accessing elements of a list using indexing and slicing
print(my_list[0])   
print(my_list[2:4])
print(my_list[-1])

1
[3, 4]
Ball


In [None]:
# Tuples: A tuple is an immutable collection of values that are ordered and can be accessed by an index. Tuples are created using parentheses ().
my_tuple = (1, 2, 3, 4, 5, 'cat', 'dog')
print(my_tuple)

(1, 2, 3, 4, 5, 'cat', 'dog')


In [None]:
# Accessing elements of a tuple using index and slicing
my_tuple = (1, 2, 3, 4, 5)
print(my_tuple[0])   
print(my_tuple[2:4]) 
print(my_tuple[-1])

1
(3, 4)
5


In [None]:
# Set is an unordered collection of unique elements. Sets are similar to lists and tuples, but unlike lists and tuples, they cannot contain duplicate elements.
# Sets are created using curly braces {} or the set() function.
my_set = {1, 2, 3, 4, 5} # using curly braces
print(my_set)   

another_set = set([4, 5, 6, 7, 8]) # using set() function
print(another_set)

{1, 2, 3, 4, 5}
{4, 5, 6, 7, 8}


In [None]:
# Dictionaries: A dictionary is a mutable collection of key-value pairs that are unordered and can be accessed by the key. Dictionaries are created using curly braces {}.
my_dict = {'name': 'Aman', 'age': 35, 'city': 'Mumbai'}
print(my_dict)

{'name': 'Aman', 'age': 35, 'city': 'Mumbai'}


In [None]:
# Accessing elements of a dictionary
print(my_dict['name'])
print(my_dict['age'])

Aman
35


# Data manipulation with pandas library

**Data Manipulation with Pandas.** Visit https://pandas.pydata.org/docs/ for official documentation.

*   Pandas provides two primary data structures for working with data: Series andDataFrame. A Series is a one-dimensional array-like object that can hold any data type, including integers, floats, and strings. A DataFrame is a two-dimensional table that consists of rows and columns. It can be thought of as a collection of Series objects, with each column representing a Series.
*   Pandas provides a wide range of functions for manipulating and analyzing data. These functions includes Data Cleaning, Data Manipulation, Data Analysis and Data Visualization.



In [None]:
import pandas as pd

# read data from CSV file
df = pd.read_csv('filename.csv')

# print the dataframe
print(df)

In [None]:
import pandas as pd

# read data from Excel file
df = pd.read_excel('filename.xlsx')

# print the dataframe
print(df)

In [None]:
import pandas as pd
import sqlite3

# create a database connection
conn = sqlite3.connect('database.db')

# read data from SQL query
df = pd.read_sql_query('SELECT * FROM table_name', conn)

# print the dataframe
print(df)

In [None]:
import pandas as pd

# read data from URL
url = 'https://example.com/data.csv'
df = pd.read_csv(url)

# print the dataframe
print(df)

In [2]:
import pandas as pd
import numpy as np

# create sample data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Dave'],
    'Age': [25, 32, 18, 47],
    'Gender': ['Female', 'Male', 'Male', 'Male'],
    'Height': [165, 180, 173, 190],
    'Weight': [55, 70, 68, 90]
}

# convert data to dataframe
df = pd.DataFrame(data)

# print the dataframe
print(df)


      Name  Age  Gender  Height  Weight
0    Alice   25  Female     165      55
1      Bob   32    Male     180      70
2  Charlie   18    Male     173      68
3     Dave   47    Male     190      90


Some common methods for filtering and selecting data using Pandas.

In [3]:
# Selecting columns: To select one or more columns from a dataframe, you can use the indexing operator [] with the column names as a list.
df[['Name', 'Age']]

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,32
2,Charlie,18
3,Dave,47


In [4]:
""" Filtering rows: You can filter rows based on certain conditions using the loc and iloc methods. 
The loc method is used for label-based indexing, while iloc is used for integer-based indexing."""
# select all rows where the 'Age' column is greater than 25
df.loc[df['Age'] > 25]

Unnamed: 0,Name,Age,Gender,Height,Weight
1,Bob,32,Male,180,70
3,Dave,47,Male,190,90


In [5]:
# Combining conditions: we can combine multiple conditions using logical operators such as & (and) and | (or).
# select all rows where the 'Age' column is greater than 25 and the 'Gender' column is 'Male'
df.loc[(df['Age'] > 25) & (df['Gender'] == 'Male')]

Unnamed: 0,Name,Age,Gender,Height,Weight
1,Bob,32,Male,180,70
3,Dave,47,Male,190,90


In [6]:
# Using "query" method: Pandas provides a query method to select data based on a query expression.
df.query('Age > 25 and Gender == "Male"')

Unnamed: 0,Name,Age,Gender,Height,Weight
1,Bob,32,Male,180,70
3,Dave,47,Male,190,90


In [7]:
# Using "isin" method: You can select rows based on whether a value is contained in a list using the isin method.
# Select all rows where the 'Gender' column is 'Male'
df.loc[df['Gender'].isin(['Male'])]

Unnamed: 0,Name,Age,Gender,Height,Weight
1,Bob,32,Male,180,70
2,Charlie,18,Male,173,68
3,Dave,47,Male,190,90


In [8]:
# Using "between" method: You can select rows based on a range of values using the between method.
# Select all rows where the 'Age' column is between 25 and 40
df.loc[df['Age'].between(25, 35)]

Unnamed: 0,Name,Age,Gender,Height,Weight
0,Alice,25,Female,165,55
1,Bob,32,Male,180,70


In [9]:
# Using "nlargest" and "nsmallest" methods: You can select the n largest or n smallest values in a column using the nlargest and nsmallest methods.
# Select the 2 rows with the highest values in the 'Age' column
df.nlargest(2, 'Age')


Unnamed: 0,Name,Age,Gender,Height,Weight
3,Dave,47,Male,190,90
1,Bob,32,Male,180,70


In [13]:
# Using "groupby" method: You can group data by one or more columns and apply aggregate functions to the groups using the groupby method.
# Group the data by the 'Gender' column and calculate the average age for each group,
df.groupby('Gender')['Age'].mean()
# df['Age'].mean()

Gender
Female    25.000000
Male      32.333333
Name: Age, dtype: float64

Below are some of the most useful pandas aggregating methods. These methods can be applied to a group of data in a pandas DataFrame or Series using the groupby() method, which groups the data based on one or more columns.

* sum(): Computes the sum of values in a group.
* mean(): Computes the mean of values in a group.
* median(): Computes the median of values in a group.
* min(): Computes the minimum value in a group.
* max(): Computes the maximum value in a group.
* count(): Counts the number of non-missing values in a group.
* std(): Computes the standard deviation of values in a group.
* var(): Computes the variance of values in a group.
* describe(): Generates descriptive statistics of values in a group.
* quantile(): Computes the quantile of values in a group.



In [26]:
import pandas as pd

# Create sample data
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank', 'Grace', 'Henry', 'Isabella', 'Jack'],
        'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
        'Gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male'],
        'Salary': [50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000],
        'Department': ['Sales', 'Marketing', 'Sales', 'Marketing', 'Sales', 'Marketing', 'Sales', 'Marketing', 'Sales', 'Marketing']}

df = pd.DataFrame(data)

# Group by department
grouped = df.groupby('Department')
df

Unnamed: 0,Name,Age,Gender,Salary,Department
0,Alice,25,Female,50000,Sales
1,Bob,30,Male,60000,Marketing
2,Charlie,35,Male,70000,Sales
3,David,40,Male,80000,Marketing
4,Emily,45,Female,90000,Sales
5,Frank,50,Male,100000,Marketing
6,Grace,55,Female,110000,Sales
7,Henry,60,Male,120000,Marketing
8,Isabella,65,Female,130000,Sales
9,Jack,70,Male,140000,Marketing


In [27]:
# Sum of salaries by department
print(grouped['Salary'].sum())

Department
Marketing    500000
Sales        450000
Name: Salary, dtype: int64


In [28]:
# Mean age by department
print(grouped['Age'].mean())

Department
Marketing    50.0
Sales        45.0
Name: Age, dtype: float64


In [16]:
# Median salary by gender
print(df.groupby('Gender')['Salary'].median())

Gender
Female    100000.0
Male       90000.0
Name: Salary, dtype: float64


In [17]:
# Minimum salary by department and gender
print(grouped['Salary'].min())

Department
Marketing    60000
Sales        50000
Name: Salary, dtype: int64


In [18]:
# Maximum age by department and gender
print(df.groupby(['Department', 'Gender'])['Age'].max())

Department  Gender
Marketing   Male      70
Sales       Female    65
            Male      35
Name: Age, dtype: int64


In [19]:
# Count of employees by department
print(grouped.size())

Department
Marketing    5
Sales        5
dtype: int64


In [20]:
# Standard deviation of salaries by department
print(grouped['Salary'].std())

Department
Marketing    31622.776602
Sales        31622.776602
Name: Salary, dtype: float64


In [21]:
# Variance of salaries by gender
print(df.groupby('Gender')['Salary'].var())

Gender
Female    1.166667e+09
Male      9.500000e+08
Name: Salary, dtype: float64


In [22]:
# Descriptive statistics of age by department
print(grouped['Age'].describe())

            count  mean        std   min   25%   50%   75%   max
Department                                                      
Marketing     5.0  50.0  15.811388  30.0  40.0  50.0  60.0  70.0
Sales         5.0  45.0  15.811388  25.0  35.0  45.0  55.0  65.0


In [23]:
# Descriptive statistics of salary by department
print(grouped['Salary'].describe())

            count      mean           std      min      25%       50%  \
Department                                                              
Marketing     5.0  100000.0  31622.776602  60000.0  80000.0  100000.0   
Sales         5.0   90000.0  31622.776602  50000.0  70000.0   90000.0   

                 75%       max  
Department                      
Marketing   120000.0  140000.0  
Sales       110000.0  130000.0  
