### Essential NumPy and Pandas Functions for Data Analysis

Let's look at practical code snippets that demonstrate the most useful NumPy and pandas methods for data analysis in Python. You can use these in a Jupyter notebook for experimentation and workflow mastery.

In [1]:
import numpy as np

1. Creating Arrays

In [2]:
arr = np.array([1, 2, 3, 4])            # 1D array
matrix = np.array([[1, 2], [3, 4]])     # 2D array
zeros = np.zeros((2, 3))                # 2x3 array of zeros
ones = np.ones((2, 3))                  # 2x3 array of ones
rnd = np.random.rand(2, 3)              # 2x3 array of random floats
print(rnd)

[[0.83119062 0.70675331 0.75050582]
 [0.85236014 0.34539269 0.74641396]]


2. Array Shape, Reshape, Flatten

In [3]:
print(arr.shape)
reshape =  matrix.reshape((4,1))
flat = matrix.ravel()

(4,)


3. Indexing and Slicing

In [4]:
print(matrix[0, 1]) # element at first row, second column
print(arr[1:3]) # slice elements 1 to 2

2
[2 3]


4. Basic Math Operations

In [5]:
a = np.array([10, 20, 30])
b = np.array([1,2,3])
print(a + b)
print(a - b)
print(a * b)
print(a / b)

[11 22 33]
[ 9 18 27]
[10 40 90]
[10. 10. 10.]


5. Aggregation and Stats

In [6]:
print(np.sum(arr))
print(np.mean(arr))
print(np.std(arr))
print(np.min(arr))
print(np.max(arr))

10
2.5
1.118033988749895
1
4


6. Dot Product, Matrix Ops

In [7]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
dot = np.dot(a, b)          # 1*4 + 2*5 + 3*6 = 32
mat = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
matprod = np.matmul(mat, mat2) # Matrix multiplication

7. Logical and Filtering

In [8]:
arr[arr > 2] # Return [3, 4]
np.where(arr > 2) # Indices where condition is true

(array([2, 3]),)

8. Random Sampling

In [9]:
np.random.randint(0, 100, size=5) # 5 random ints 0-99
np.random.choice([0,1], size=10)  # 10 random 0s  or 1s

array([0, 0, 0, 1, 0, 1, 0, 1, 1, 0])

### Pandas Basics for Data Analysis

In [10]:
import pandas as pd

1. Creating DataFrame

In [11]:
data = {'A':[1,2,3], 'B':[4,5,6]}
df = pd.DataFrame(data) # Create DataFrame
df.head()

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


2. Inspecting Data

In [12]:
df.head()                # First 5 rows
df.tail(3)                # Last 3 rows
df.info()                 # DataFrame info + datatypes
df.describe()             # Summary statistics

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      int64
 1   B       3 non-null      int64
dtypes: int64(2)
memory usage: 180.0 bytes


Unnamed: 0,A,B
count,3.0,3.0
mean,2.0,5.0
std,1.0,1.0
min,1.0,4.0
25%,1.5,4.5
50%,2.0,5.0
75%,2.5,5.5
max,3.0,6.0


3. Selecting Data

In [13]:
df['A']                   # Get column A
df.loc[0]                 # Row by label (index 0)
df.iloc[1:3]              # Rows 1-2 by position
df[df['A'] > 1]           # Filter rows where A > 1

Unnamed: 0,A,B
1,2,5
2,3,6


4. Adding/Dropping Columns

In [14]:
df['C'] = df['A'] + df['B']     # Add new column
new_df = df.drop('B', axis=1)   # Drop column B
new_df

Unnamed: 0,A,C
0,1,5
1,2,7
2,3,9


5. Missing Values & Types

In [15]:
# df.isnull().sum()               # Missing values per column
# df.fillna(0)                    # Replace NaNs with 0
df['A'] = df['A'].astype(float) # Change column type
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       3 non-null      float64
 1   B       3 non-null      int64  
 2   C       3 non-null      int64  
dtypes: float64(1), int64(2)
memory usage: 204.0 bytes


6. Grouping and Aggregation

In [16]:
grouped = df.groupby('A').sum() # Sum columns, grouped by values in A
df['A'].value_counts()

Unnamed: 0_level_0,count
A,Unnamed: 1_level_1
1.0,1
2.0,1
3.0,1


7. Sorting

In [17]:
df.sort_values('B', ascending=False)  # Sort by column B

Unnamed: 0,A,B,C
2,3.0,6,9
1,2.0,5,7
0,1.0,4,5


8. Reading/Writing Files

In [18]:
df.to_csv('mydata.csv') # Save to CSV
df = pd.read_csv('mydata.csv') # Load from CSV
df.head()

Unnamed: 0.1,Unnamed: 0,A,B,C
0,0,1.0,4,5
1,1,2.0,5,7
2,2,3.0,6,9


9. Apply/Map Functions

In [19]:
df['A_squared'] = df['A'].apply(lambda x: x**2) # Apply to column
df['B_label'] = df['B'].map({4: 'Low', 5: 'Med', 6: 'High'}) # Map values
df.sample(3)

Unnamed: 0.1,Unnamed: 0,A,B,C,A_squared,B_label
1,1,2.0,5,7,4.0,Med
0,0,1.0,4,5,1.0,Low
2,2,3.0,6,9,9.0,High


### Task:

- Generated a Synthetic Dataset with columns of `age`, `salary`, `department`, `years_experience`, and `is_manager`.

In [20]:
np.random.seed(42) # For reproducibility
n_samples = 100 # Number of samples

data = {
    'age': np.random.randint(18, 60, size=n_samples),
    'salary': np.random.randint(30000, 120000, size=n_samples),
    'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing'], size=n_samples),
    'years_experience': np.round(np.random.normal(5, 2, size=n_samples), 1),
    'is_manager': np.random.choice([0, 1], size=n_samples)
}
df = pd.DataFrame(data)

Q1. View data structure

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               100 non-null    int64  
 1   salary            100 non-null    int64  
 2   department        100 non-null    object 
 3   years_experience  100 non-null    float64
 4   is_manager        100 non-null    int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 4.0+ KB


Q2. Get DataFrame Info and Summary Stats

In [22]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               100 non-null    int64  
 1   salary            100 non-null    int64  
 2   department        100 non-null    object 
 3   years_experience  100 non-null    float64
 4   is_manager        100 non-null    int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 4.0+ KB


Unnamed: 0,age,salary,years_experience,is_manager
count,100.0,100.0,100.0,100.0
mean,37.91,77809.16,4.823,0.47
std,12.219454,26058.643576,2.237822,0.501614
min,18.0,30206.0,-0.8,0.0
25%,26.75,55141.0,3.475,0.0
50%,38.0,80932.0,4.7,0.0
75%,46.25,98107.25,6.0,1.0
max,59.0,119474.0,11.4,1.0


Q3. Do Simple Numpy Operations

In [23]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               100 non-null    int64  
 1   salary            100 non-null    int64  
 2   department        100 non-null    object 
 3   years_experience  100 non-null    float64
 4   is_manager        100 non-null    int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 4.0+ KB


Unnamed: 0,age,salary,years_experience,is_manager
count,100.0,100.0,100.0,100.0
mean,37.91,77809.16,4.823,0.47
std,12.219454,26058.643576,2.237822,0.501614
min,18.0,30206.0,-0.8,0.0
25%,26.75,55141.0,3.475,0.0
50%,38.0,80932.0,4.7,0.0
75%,46.25,98107.25,6.0,1.0
max,59.0,119474.0,11.4,1.0


Q4. Filtering and Indexing Rows

In [24]:
print('Rows with salary > 80000:')
print(df[df['salary'] > 80000].head())
print('\nRow at index 5 using .loc:')
print(df.loc[5])
print('\nRows from index 10 to 12 using .iloc:')
print(df.iloc[10:13])

Rows with salary > 80000:
   age  salary department  years_experience  is_manager
2   32  108603         HR               5.0           1
3   25   82256         HR               4.2           1
4   38  119135         HR               4.1           1
6   36  107373    Finance               6.7           1
7   40  109575  Marketing               6.9           0

Row at index 5 using .loc:
age                      56
salary                65222
department          Finance
years_experience        2.8
is_manager                0
Name: 5, dtype: object

Rows from index 10 to 12 using .iloc:
    age  salary department  years_experience  is_manager
10   41   40965         IT               9.5           1
11   53   54538  Marketing               5.3           0
12   57  100592    Finance               6.1           1


Q5. Adding a Column

In [25]:
df['salary_per_year'] = df['salary'] / (df['years_experience'] + 1) # Added +1 to avoid division by zero if years_experience is 0 or less
print(df.head())

   age  salary department  years_experience  is_manager  salary_per_year
0   56   38392         IT              -0.8           0    191960.000000
1   46   60535  Marketing               3.4           1     13757.954545
2   32  108603         HR               5.0           1     18100.500000
3   25   82256         HR               4.2           1     15818.461538
4   38  119135         HR               4.1           1     23359.803922


Q6. Grouping and Aggregation

In [26]:
avg_salary_per_department = df.groupby('department')['salary'].mean()
print('Average Salary per Department:')
print(avg_salary_per_department)

# Also show value counts for 'is_manager' for another aggregation example
print('\nManager counts:')
print(df['is_manager'].value_counts())

Average Salary per Department:
department
Finance      83124.708333
HR           73523.052632
IT           75825.476190
Marketing    77684.722222
Name: salary, dtype: float64

Manager counts:
is_manager
0    53
1    47
Name: count, dtype: int64
