# Python Cheatsheet 

## Contents  
1. <a href='#section1'>Syntax and whitespace</a>
2. <a href='#section2'>Comments</a>
3. <a href='#section3'>Numbers and operations</a>
4. <a href='#section4'>String manipulation</a>
5. <a href='#section5'>Lists, tuples, and dictionaries</a>
6. <a href='#section6'>JSON</a>
7. <a href='#section7'>Loops</a>
8. <a href='#section8'>File handling</a>
9. <a href='#section9'>Functions</a>
10. <a href='#section10'>Working with datetime</a>
11. <a href='#section11'>NumPy</a>
12. <a href='#section12'>Pandas</a>

To run a cell, press **Shift+Enter** or click **Run** at the top of the page.

<a id="section_1"></a>

## 1. Syntax and whitespace
Python uses indented space to indicate the level of statements. The following cell is an example where '**if**' and '**else**' are in same level, while '**print**' is separated by space to a different level. Spacing should be the same for items that are on the same level.

In [3]:
student_number = input("Enter your student number:")
if int(student_number) != 0:
    print("Welcome student {}".format(student_number))
else:
    print("Try again!")

Welcome student 23000195


<a id='section2'></a>

## 2. Comments
In Python, comments start with hash '#' and extend to the end of the line. '#' can be at the begining of the line or after code. 

In [4]:
# This is code to print hello world!

print("Hello world!") # Print statement for hello world
print("# is not a comment in this case")

Hello world!
# is not a comment in this case


<a id='section3'></a>

## 3. Numbers and operations

Like with other programming languages, there are four types of numbers: 
- Integers (e.g., 1, 20, 45, 1000) indicated by *int*
- Floating point numbers (e.g., 1.25, 20.35, 1000.00) indicated by *float*
- Long integers 
- Complex numbers (e.g., x+2y where x is known)

Operation       |      Result
----------------|-------------------------------------               
x + y	        |      Sum of x and y	
x - y	        |      Difference of x and y	
x * y	        |      Product of x and y	
x / y	        |      Quotient of x and y
x // y	        |      Quotient of x and y (floored)
x % y	        |      Remainder of x / y
abs(x)	        |      Absolute value of x	
int(x)	        |      x converted to integer
long(x)	        |      x converted to long integer
float(x)	    |      x converted to floating point	
pow(x, y)	    |      x to the power y	
x ** y	        |      x to the power y	

In [5]:
# Number examples
a = 5 + 8
print("Sum of int numbers: {} and number format is {}".format(a, type(a)))

b = 5 + 2.3
print ("Sum of int and {} and number format is {}".format(b, type(b)))

Sum of int numbers: 13 and number format is <class 'int'>
Sum of int and 7.3 and number format is <class 'float'>


<a id='section4'></a>

## 4. String manipulation

Python has rich features like other programming languages for string manipulation.

In [6]:
# Store strings in a variable
test_word = "hello world to everyone"

# Print the test_word value
print(test_word)

# Use [] to access the character of the string. The first character is indicated by '0'.
print(test_word[0])

# Use the len() function to find the length of the string
print(len(test_word))

# Some examples of finding in strings
print(test_word.count('l')) # Count number of times l repeats in the string
print(test_word.find("o")) # Find letter 'o' in the string. Returns the position of first match.
print(test_word.count(' ')) # Count number of spaces in the string
print(test_word.upper()) # Change the string to uppercase
print(test_word.lower()) # Change the string to lowercase
print(test_word.replace("everyone","you")) # Replace word "everyone" with "you"
print(test_word.title()) # Change string to title format
print(test_word + "!!!") # Concatenate strings
print(":".join(test_word)) # Add ":" between each character
print("".join(reversed(test_word))) # Reverse the string 

hello world to everyone
h
23
3
4
3
HELLO WORLD TO EVERYONE
hello world to everyone
hello world to you
Hello World To Everyone
hello world to everyone!!!
h:e:l:l:o: :w:o:r:l:d: :t:o: :e:v:e:r:y:o:n:e
enoyreve ot dlrow olleh


<a id='section5'></a>

## 5. Lists, tuples, and dictionaries

Python supports data types lists, tuples, dictionaries, and arrays.

### Lists

A list is created by placing all the items (elements) inside square brackets \[ ] separated by commas. A list can have any number of items, and they may be of different types (integer, float, strings, etc.).

In [7]:
# A Python list is similar to an array. You can create an empty list too.

my_list = []

first_list = [3, 5, 7, 10]
second_list = [1, 'python', 3]

In [8]:
# Nest multiple lists
nested_list = [first_list, second_list]
nested_list

[[3, 5, 7, 10], [1, 'python', 3]]

In [9]:
# Combine multiple lists
combined_list = first_list + second_list
combined_list

[3, 5, 7, 10, 1, 'python', 3]

In [10]:
# You can slice a list, just like strings
combined_list[0:3]

[3, 5, 7]

In [11]:
# Append a new entry to the list
combined_list.append(600)
combined_list

[3, 5, 7, 10, 1, 'python', 3, 600]

In [12]:
# Remove the last entry from the list
combined_list.pop()

600

In [13]:
# Iterate the list
for item in combined_list:
    print(item)    

3
5
7
10
1
python
3


### Tuples

A tuple is similar to a list, but you use them with parentheses ( ) instead of square brackets. The main difference is that a tuple is immutable, while a list is mutable.

In [14]:
my_tuple = (1, 2, 3, 4, 5)
my_tuple[1:4]

(2, 3, 4)

### Dictionaries

A dictionary is also known as an associative array. A dictionary consists of a collection of key-value pairs. Each key-value pair maps the key to its associated value.

In [15]:
desk_location = {'jack': 123, 'joe': 234, 'hary': 543}
desk_location['jack']

123

<a id='section6'></a>

## 6. JSON 

JSON is text writen in JavaScript Object Notation. Python has a built-in package called `json` that can be used to work with JSON data.

In [16]:
import json

# Sample JSON data
x = '{"first_name":"Jane", "last_name":"Doe", "age":25, "city":"Chicago"}'

# Read JSON data
y = json.loads(x)

# Print the output, which is similar to a dictonary
print("Employee name is "+ y["first_name"] + " " + y["last_name"])

Employee name is Jane Doe


<a id='section7'></a>

## 7. Loops
**If, Else, ElIf loop**: Python supports conditional statements like any other programming language. Python relies on indentation (whitespace at the begining of the line) to define the scope of the code. 

In [17]:
a = 22
b = 33
c = 100

# if ... else example
if a > b:
    print("a is greater than b")
else:
    print("b is greater than a")
    
    
# if .. else .. elif example

if a > b:
    print("a is greater than b")
elif b > c:
    print("b is greater than c")
else:
    print("b is greater than a and c is greater than b")

b is greater than a
b is greater than a and c is greater than b


**While loop:** Runs a set of statements as long as the condition is true

In [18]:
# Sample while example
i = 1
while i < 10:
    print("count is " + str(i))
    i += 1

print("="*10)

# Continue to next iteration if x is 2. Finally, print message once the condition is false.

x = 0
while x < 5:
    x += 1
    if x == 2:
        continue
    print(x)
else:
    print("x is no longer less than 5")

count is 1
count is 2
count is 3
count is 4
count is 5
count is 6
count is 7
count is 8
count is 9
1
3
4
5
x is no longer less than 5


**For loop:** A `For` loop is more like an iterator in Python. A `For` loop is used for iterating over a sequence (list, tuple, dictionay, set, string, or range).

In [19]:
# Sample for loop examples
fruits = ["orange", "banana", "apple", "grape", "cherry"]
for fruit in fruits:
    print(fruit)

print("\n")
print("="*10)
print("\n")

# Iterating range
for x in range(1, 10, 2):
    print(x)
else:
    print("task complete")

print("\n")
print("="*10)
print("\n")

# Iterating multiple lists
traffic_lights = ["red", "yellow", "green"]
action = ["stop", "slow down", "go"]

for light in traffic_lights:
    for task in action:
        print(light, task)

orange
banana
apple
grape
cherry




1
3
5
7
9
task complete




red stop
red slow down
red go
yellow stop
yellow slow down
yellow go
green stop
green slow down
green go


<a id='section8'></a>

## 8. File handling
The key function for working with files in Python is the `open()` function. The `open()` function takes two parameters: filename and mode.

There are four different methods (modes) for opening a file:

- "r" - Read
- "a" - Append
- "w" - Write
- "x" - Create

In addition, you can specify if the file should be handled in binary or text mode.

- "t" - Text
- "b" - Binary

In [20]:
# Let's create a test text file
!echo "This is a test file with text in it. This is the first line." > test.txt
!echo "This is the second line." >> test.txt
!echo "This is the third line." >> test.txt

In [21]:
# Read file
file = open('test.txt', 'r')
print(file.read())
file.close()

print("\n")
print("="*10)
print("\n")

# Read first 10 characters of the file
file = open('test.txt', 'r')
print(file.read(10))
file.close()

print("\n")
print("="*10)
print("\n")

# Read line from the file

file = open('test.txt', 'r')
print(file.readline())
file.close()

"This is a test file with text in it. This is the first line." 
"This is the second line." 
"This is the third line." 





"This is a




"This is a test file with text in it. This is the first line." 



In [22]:
# Create new file

file = open('test2.txt', 'w')
file.write("This is content in the new test2 file.")
file.close()

# Read the content of the new file
file = open('test2.txt', 'r')
print(file.read())
file.close()

This is content in the new test2 file.


In [23]:
# Update file
file = open('test2.txt', 'a')
file.write("\nThis is additional content in the new file.")
file.close()

# Read the content of the new file
file = open('test2.txt', 'r')
print(file.read())
file.close()

This is content in the new test2 file.
This is additional content in the new file.


In [24]:
# Delete file
import os
file_names = ["test.txt", "test2.txt"]
for item in file_names:
    if os.path.exists(item):
        os.remove(item)
        print(f"File {item} removed successfully!")
    else:
        print(f"{item} file does not exist.")

File test.txt removed successfully!
File test2.txt removed successfully!


<a id='section9'></a>

## 9. Functions

A function is a block of code that runs when it is called. You can pass data, or *parameters*, into the function. In Python, a function is defined by `def`.

In [25]:
# Defining a function
def new_funct():
    print("A simple function")

# Calling the function
new_funct()

A simple function


In [26]:
# Sample fuction with parameters

def param_funct(first_name):
    print(f"Employee name is {first_name}.")

param_funct("Harry")
param_funct("Larry")
param_funct("Shally")

Employee name is Harry.
Employee name is Larry.
Employee name is Shally.


**Anonymous functions (lambda):** A lambda is a small anonymous function. A lambda function can take any number of arguments but only one expression.

In [27]:
# Sample lambda example
x = lambda y: y + 100
print(x(15))

print("\n")
print("="*10)
print("\n")

x = lambda a, b: a*b/100
print(x(2,4))

115




0.08


<a id='section10'></a>

## 10. Working with datetime 

A `datetime` module in Python can be used to work with date objects.

In [28]:
import datetime

x = datetime.datetime.now()

print(x)
print(x.year)
print(x.strftime("%A"))
print(x.strftime("%B"))
print(x.strftime("%d"))
print(x.strftime("%H:%M:%S %p"))

2024-10-11 10:27:09.595768
2024
Friday
October
11
10:27:09 AM


<a id='section11'></a>

## 11. NumPy

NumPy is the fundamental package for scientific computing with Python. Among other things, it contains:

- Powerful N-dimensional array object
- Sophisticated (broadcasting) functions
- Tools for integrating C/C++ and Fortran code
- Useful linear algebra, Fourier transform, and random number capabilities

In [29]:
# Install NumPy using pip
!pip install --upgrade pip
!pip install numpy



In [30]:
# Import NumPy module
import numpy as np

### Inspecting your array

In [31]:
# Create array
a = np.arange(15).reshape(3, 5) # Create array with range 0-14 in 3 by 5 dimension
b = np.zeros((3,5)) # Create array with zeroes
c = np.ones( (2,3,4), dtype=np.int16 ) # Createarray with ones and defining data types
d = np.ones((3,5))

In [32]:
a.shape # Array dimension

(3, 5)

In [33]:
len(b)# Length of array

3

In [34]:
c.ndim # Number of array dimensions

3

In [35]:
a.size # Number of array elements

15

In [36]:
b.dtype # Data type of array elements

dtype('float64')

In [37]:
c.dtype.name # Name of data type

'int16'

In [38]:
c.astype(float) # Convert an array type to a different type

array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]])

### Basic math operations

In [39]:
# Create array
a = np.arange(15).reshape(3, 5) # Create array with range 0-14 in 3 by 5 dimension
b = np.zeros((3,5)) # Create array with zeroes
c = np.ones( (2,3,4), dtype=np.int16 ) # Createarray with ones and defining data types
d = np.ones((3,5))

In [40]:
np.add(a,b) # Addition

array([[ 0.,  1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.]])

In [41]:
np.subtract(a,b) # Substraction

array([[ 0.,  1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.]])

In [42]:
np.divide(a,d) # Division

array([[ 0.,  1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.]])

In [43]:
np.multiply(a,d) # Multiplication

array([[ 0.,  1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.]])

In [44]:
np.array_equal(a,b) # Comparison - arraywise

False

### Aggregate functions

In [45]:
# Create array
a = np.arange(15).reshape(3, 5) # Create array with range 0-14 in 3 by 5 dimension
b = np.zeros((3,5)) # Create array with zeroes
c = np.ones( (2,3,4), dtype=np.int16 ) # Createarray with ones and defining data types
d = np.ones((3,5))

In [46]:
a.sum() # Array-wise sum

np.int64(105)

In [47]:
a.min() # Array-wise min value

np.int64(0)

In [48]:
a.mean() # Array-wise mean

np.float64(7.0)

In [49]:
a.max(axis=0) # Max value of array row

array([10, 11, 12, 13, 14])

In [50]:
np.std(a) # Standard deviation

np.float64(4.320493798938574)

### Subsetting, slicing, and indexing

In [51]:
# Create array
a = np.arange(15).reshape(3, 5) # Create array with range 0-14 in 3 by 5 dimension
b = np.zeros((3,5)) # Create array with zeroes
c = np.ones( (2,3,4), dtype=np.int16 ) # Createarray with ones and defining data types
d = np.ones((3,5))

In [52]:
a[1,2] # Select element of row 1 and column 2

np.int64(7)

In [53]:
a[0:2] # Select items on index 0 and 1

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [54]:
a[:1] # Select all items at row 0

array([[0, 1, 2, 3, 4]])

In [55]:
a[-1:] # Select all items from last row

array([[10, 11, 12, 13, 14]])

In [56]:
a[a<2] # Select elements from 'a' that are less than 2

array([0, 1])

### Array manipulation

In [57]:
# Create array
a = np.arange(15).reshape(3, 5) # Create array with range 0-14 in 3 by 5 dimension
b = np.zeros((3,5)) # Create array with zeroes
c = np.ones( (2,3,4), dtype=np.int16 ) # Createarray with ones and defining data types
d = np.ones((3,5))

In [58]:
np.transpose(a) # Transpose array 'a'

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

In [59]:
a.ravel() # Flatten the array

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [60]:
a.reshape(5,-2) # Reshape but don't change the data

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [61]:
np.append(a,b) # Append items to the array

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.,
       13., 14.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.])

In [62]:
np.concatenate((a,d), axis=0) # Concatenate arrays

array([[ 0.,  1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [63]:
np.vsplit(a,3) # Split array vertically at 3rd index

[array([[0, 1, 2, 3, 4]]),
 array([[5, 6, 7, 8, 9]]),
 array([[10, 11, 12, 13, 14]])]

In [64]:
np.hsplit(a,5) # Split array horizontally at 5th index

[array([[ 0],
        [ 5],
        [10]]),
 array([[ 1],
        [ 6],
        [11]]),
 array([[ 2],
        [ 7],
        [12]]),
 array([[ 3],
        [ 8],
        [13]]),
 array([[ 4],
        [ 9],
        [14]])]

<a id='section12'></a>

## Pandas

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Pandas DataFrames are the most widely used in-memory representation of complex data collections within Python.

In [65]:
# Install pandas, xlrd, and openpyxl using pip
!pip install pandas
!pip install xlrd openpyxl

Collecting xlrd
  Downloading xlrd-2.0.1-py2.py3-none-any.whl.metadata (3.4 kB)
Downloading xlrd-2.0.1-py2.py3-none-any.whl (96 kB)
Installing collected packages: xlrd
Successfully installed xlrd-2.0.1


In [66]:
# Import NumPy and Pandas modules
import numpy as np
import pandas as pd

In [67]:
# Sample dataframe df
df = pd.DataFrame({'num_legs': [2, 4, np.nan, 0],
                   'num_wings': [2, 0, 0, 0],
                   'num_specimen_seen': [10, np.nan, 1, 8]},
                   index=['falcon', 'dog', 'spider', 'fish'])
df # Display dataframe df

Unnamed: 0,num_legs,num_wings,num_specimen_seen
falcon,2.0,2,10.0
dog,4.0,0,
spider,,0,1.0
fish,0.0,0,8.0


In [68]:
# Another sample dataframe df1 - using NumPy array with datetime index and labeled column
df1 = pd.date_range('20130101', periods=6)
df1 = pd.DataFrame(np.random.randn(6, 4), index=df1, columns=list('ABCD'))
df1 # Display dataframe df1

Unnamed: 0,A,B,C,D
2013-01-01,0.129859,-0.504064,-0.441824,-0.630423
2013-01-02,0.266233,-0.628018,-0.134917,0.039806
2013-01-03,-0.560826,-0.7387,-0.624981,1.849712
2013-01-04,0.400586,0.61566,1.178935,0.107508
2013-01-05,0.137033,0.072273,2.208482,0.171967
2013-01-06,1.015201,-0.962661,0.347015,-1.214867


### Viewing data

In [69]:
df1 = pd.date_range('20130101', periods=6)
df1 = pd.DataFrame(np.random.randn(6, 4), index=df1, columns=list('ABCD'))

In [70]:
df1.head(2) # View top data

Unnamed: 0,A,B,C,D
2013-01-01,-0.251326,-0.585496,0.786763,0.399773
2013-01-02,1.213697,0.893901,-1.372013,0.640286


In [71]:
df1.tail(2) # View bottom data

Unnamed: 0,A,B,C,D
2013-01-05,-1.686517,-1.296137,-0.678734,-0.029172
2013-01-06,-0.372325,-1.406001,0.902337,0.609066


In [72]:
df1.index # Display index column

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [73]:
df1.dtypes # Inspect datatypes

A    float64
B    float64
C    float64
D    float64
dtype: object

In [74]:
df1.describe() # Display quick statistics summary of data

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.096533,-0.670314,0.376967,0.462933
std,0.959445,0.830678,1.135422,0.271594
min,-1.686517,-1.406001,-1.372013,-0.029172
25%,-0.342075,-1.201806,-0.31236,0.407881
50%,-0.022819,-0.814075,0.84455,0.520635
75%,0.28511,-0.616456,1.058457,0.632481
max,1.213697,0.893901,1.512955,0.725437


### Subsetting, slicing, and indexing

In [75]:
df1 = pd.date_range('20130101', periods=6)
df1 = pd.DataFrame(np.random.randn(6, 4), index=df1, columns=list('ABCD'))

In [76]:
df1.T # Transpose data

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,1.323774,-1.173167,2.004768,1.204739,-0.915693,-0.960215
B,-0.74799,0.668436,0.760061,0.923011,-0.657657,-0.303763
C,0.073254,0.374643,0.233664,1.180535,0.277914,0.331497
D,-0.521303,1.182162,0.682977,-0.064529,0.484448,-1.017554


In [77]:
df1.sort_index(axis=1, ascending=False) # Sort by an axis

Unnamed: 0,D,C,B,A
2013-01-01,-0.521303,0.073254,-0.74799,1.323774
2013-01-02,1.182162,0.374643,0.668436,-1.173167
2013-01-03,0.682977,0.233664,0.760061,2.004768
2013-01-04,-0.064529,1.180535,0.923011,1.204739
2013-01-05,0.484448,0.277914,-0.657657,-0.915693
2013-01-06,-1.017554,0.331497,-0.303763,-0.960215


In [78]:
df1.sort_values(by='B') # Sort by values

Unnamed: 0,A,B,C,D
2013-01-01,1.323774,-0.74799,0.073254,-0.521303
2013-01-05,-0.915693,-0.657657,0.277914,0.484448
2013-01-06,-0.960215,-0.303763,0.331497,-1.017554
2013-01-02,-1.173167,0.668436,0.374643,1.182162
2013-01-03,2.004768,0.760061,0.233664,0.682977
2013-01-04,1.204739,0.923011,1.180535,-0.064529


In [79]:
df1['A'] # Select column A

2013-01-01    1.323774
2013-01-02   -1.173167
2013-01-03    2.004768
2013-01-04    1.204739
2013-01-05   -0.915693
2013-01-06   -0.960215
Freq: D, Name: A, dtype: float64

In [80]:
df1[0:3] # Select index 0 to 2

Unnamed: 0,A,B,C,D
2013-01-01,1.323774,-0.74799,0.073254,-0.521303
2013-01-02,-1.173167,0.668436,0.374643,1.182162
2013-01-03,2.004768,0.760061,0.233664,0.682977


In [81]:
df1['20130102':'20130104'] # Select from index matching the values

Unnamed: 0,A,B,C,D
2013-01-02,-1.173167,0.668436,0.374643,1.182162
2013-01-03,2.004768,0.760061,0.233664,0.682977
2013-01-04,1.204739,0.923011,1.180535,-0.064529


In [82]:
df1.loc[:, ['A', 'B']] # Select on a multi-axis by label

Unnamed: 0,A,B
2013-01-01,1.323774,-0.74799
2013-01-02,-1.173167,0.668436
2013-01-03,2.004768,0.760061
2013-01-04,1.204739,0.923011
2013-01-05,-0.915693,-0.657657
2013-01-06,-0.960215,-0.303763


In [83]:
df1.iloc[3] # Select via the position of the passed integers

A    1.204739
B    0.923011
C    1.180535
D   -0.064529
Name: 2013-01-04 00:00:00, dtype: float64

In [84]:
df1[df1 > 0] # Select values from a DataFrame where a boolean condition is met

Unnamed: 0,A,B,C,D
2013-01-01,1.323774,,0.073254,
2013-01-02,,0.668436,0.374643,1.182162
2013-01-03,2.004768,0.760061,0.233664,0.682977
2013-01-04,1.204739,0.923011,1.180535,
2013-01-05,,,0.277914,0.484448
2013-01-06,,,0.331497,


In [85]:
df2 = df1.copy() # Copy the df1 dataset to df2
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three'] # Add column E with value
df2[df2['E'].isin(['two', 'four'])] # Use isin method for filtering

Unnamed: 0,A,B,C,D,E
2013-01-03,2.004768,0.760061,0.233664,0.682977,two
2013-01-05,-0.915693,-0.657657,0.277914,0.484448,four


### Missing data

Pandas primarily uses the value `np.nan` to represent missing data. It is not included in computations by default.

In [86]:
df = pd.DataFrame({'num_legs': [2, 4, np.nan, 0],
                   'num_wings': [2, 0, 0, 0],
                   'num_specimen_seen': [10, np.nan, 1, 8]},
                   index=['falcon', 'dog', 'spider', 'fish'])

In [87]:
df.dropna(how='any') # Drop any rows that have missing data

Unnamed: 0,num_legs,num_wings,num_specimen_seen
falcon,2.0,2,10.0
fish,0.0,0,8.0


In [88]:
df.dropna(how='any', axis=1) # Drop any columns that have missing data

Unnamed: 0,num_wings
falcon,2
dog,0
spider,0
fish,0


In [89]:
df.fillna(value=5) # Fill missing data with value 5

Unnamed: 0,num_legs,num_wings,num_specimen_seen
falcon,2.0,2,10.0
dog,4.0,0,5.0
spider,5.0,0,1.0
fish,0.0,0,8.0


In [90]:
pd.isna(df) # To get boolean mask where data is missing

Unnamed: 0,num_legs,num_wings,num_specimen_seen
falcon,False,False,False
dog,False,False,True
spider,True,False,False
fish,False,False,False


### File handling

In [91]:
df = pd.DataFrame({'num_legs': [2, 4, np.nan, 0],
                   'num_wings': [2, 0, 0, 0],
                   'num_specimen_seen': [10, np.nan, 1, 8]},
                   index=['falcon', 'dog', 'spider', 'fish'])

In [92]:
df.to_csv('foo.csv') # Write to CSV file

In [93]:
pd.read_csv('foo.csv') # Read from CSV file

Unnamed: 0.1,Unnamed: 0,num_legs,num_wings,num_specimen_seen
0,falcon,2.0,2,10.0
1,dog,4.0,0,
2,spider,,0,1.0
3,fish,0.0,0,8.0


In [94]:
df.to_excel('foo.xlsx', sheet_name='Sheet1') # Write to Microsoft Excel file

In [95]:
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'], engine='openpyxl') # Read from Microsoft Excel file

Unnamed: 0.1,Unnamed: 0,num_legs,num_wings,num_specimen_seen
0,falcon,2.0,2,10.0
1,dog,4.0,0,
2,spider,,0,1.0
3,fish,0.0,0,8.0


### Plotting

In [96]:
# Install Matplotlib using pip
!pip install matplotlib

^C


Collecting matplotlib
  Downloading matplotlib-3.9.2-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.0-cp312-cp312-win_amd64.whl.metadata (5.4 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.54.1-cp312-cp312-win_amd64.whl.metadata (167 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.7-cp312-cp312-win_amd64.whl.metadata (6.4 kB)
Collecting pillow>=8 (from matplotlib)
  Downloading pillow-10.4.0-cp312-cp312-win_amd64.whl.metadata (9.3 kB)
Collecting pyparsing>=2.3.1 (from matplotlib)
  Downloading pyparsing-3.1.4-py3-none-any.whl.metadata (5.1 kB)
Downloading matplotlib-3.9.2-cp312-cp312-win_amd64.whl (7.8 MB)
   ---------------------------------------- 0.0/7.8 MB ? eta -:--:--
   -- ------------------------------------- 0.5/7.8 MB 2.1 MB/s eta 0:00:04
   -

In [None]:
from matplotlib import pyplot as plt # Import Matplotlib module

In [None]:
# Generate random time-series data
ts = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000', periods=1000)) 
ts.head()

In [None]:
ts = ts.cumsum()
ts.plot() # Plot graph
plt.show()

In [None]:
# On a DataFrame, the plot() method is convenient to plot all of the columns with labels
df4 = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df4 = df4.cumsum()
df4.head()

In [1]:
df4.plot()
plt.show()

NameError: name 'df4' is not defined

In [97]:
import pandas as pd

In [99]:
import numpy as np

In [105]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [106]:
pop[[i for i in pop.index if i[1] == 2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

In [107]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

In [108]:
pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [109]:
pop[:, 2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

In [111]:
pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [112]:
pop_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [114]:
pop_df = pd.DataFrame({'total': pop,'under18': [9267089, 9284094,4687374, 4318033,5906301, 6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In [115]:
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


In [116]:
df = pd.DataFrame(np.random.rand(4, 2),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],columns=['data1', 'data2'])
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.548808,0.3183
a,2,0.458721,0.45134
b,1,0.797501,0.253295
b,2,0.515591,0.691823


In [117]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

In [118]:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [119]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [120]:
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [122]:
pd.MultiIndex(levels=[['a', 'b'], [1, 2]],labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

TypeError: MultiIndex.__new__() got an unexpected keyword argument 'labels'

In [123]:
pop.index.names = ['state', 'year']
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [124]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,39.0,36.2,32.0,36.4,34.0,36.8
2013,2,55.0,36.8,48.0,37.6,36.0,36.0
2014,1,39.0,36.9,46.0,38.5,40.0,36.7
2014,2,31.0,36.1,30.0,35.1,44.0,38.0


In [125]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,32.0,36.4
2013,2,48.0,37.6
2014,1,46.0,38.5
2014,2,30.0,35.1


In [126]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [127]:
pop['California', 2000]

np.int64(33871648)

In [128]:
pop['California']

year
2000    33871648
2010    37253956
dtype: int64

In [129]:
pop.loc['California':'New York']

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

In [130]:
pop[:, 2000]

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

In [131]:
pop[pop > 22000000]

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

In [132]:
pop[['California', 'Texas']]

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

In [133]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,39.0,36.2,32.0,36.4,34.0,36.8
2013,2,55.0,36.8,48.0,37.6,36.0,36.0
2014,1,39.0,36.9,46.0,38.5,40.0,36.7
2014,2,31.0,36.1,30.0,35.1,44.0,38.0


In [134]:
health_data['Guido', 'HR']

year  visit
2013  1        32.0
      2        48.0
2014  1        46.0
      2        30.0
Name: (Guido, HR), dtype: float64

In [135]:
health_data.iloc[:2, :2]

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,39.0,36.2
2013,2,55.0,36.8


In [136]:
health_data.loc[:, ('Bob', 'HR')]

year  visit
2013  1        39.0
      2        55.0
2014  1        39.0
      2        31.0
Name: (Bob, HR), dtype: float64

In [137]:
health_data.loc[(:, 1), (:, 'HR')]

SyntaxError: invalid syntax (3311942670.py, line 1)

In [139]:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']] # type: ignore

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,39.0,32.0,34.0
2014,1,39.0,46.0,40.0
