### Essential NumPy and Pandas Functions for Data Analysis

Let's look at practical code snippets that demonstrate the most useful NumPy and pandas methods for data analysis in Python. You can use these in a Jupyter notebook for experimentation and workflow mastery.

#### NumPy Basics for Data Analysis
Import NumPy:

In [8]:
import numpy as np

ModuleNotFoundError: No module named 'numpy'

1. Creating Arrays

In [None]:
arr = np.array([1, 2, 3, 4])            # 1D array
matrix = np.array([[1, 2], [3, 4]])     # 2D array
zeros = np.zeros((2, 3))                # 2x3 array of zeros
ones = np.ones((2, 3))                  # 2x3 array of ones
rnd = np.random.rand(2, 3)              # 2x3 array of random floats

2. Array Shape, Reshape, Flatten

In [None]:
print(arr.shape)                # (4,)
reshaped = matrix.reshape((4, 1))   # Reshape to 4 rows, 1 column
flat = matrix.ravel()               # Flatten to 1D array

(4,)


3. Indexing and Slicing

In [None]:
print(matrix[0, 1])    # element at first row, second column
print(arr[1:3])        # slice elements 1 to 2

2
[2 3]


4. Basic Math Operations

In [None]:
a = np.array([10, 20, 30])
b = np.array([1, 2, 3])
print(a + b)           # [11 22 33]
print(a - b)           # [ 9 18 27]
print(a * b)           # [10 40 90]
print(a / b)           # [10. 10. 10.]


[11 22 33]
[ 9 18 27]
[10 40 90]
[10. 10. 10.]


5. Aggregation and Stats

In [None]:
np.sum(arr)                 # 10
np.mean(arr)                # 2.5
np.std(arr)                 # Standard deviation
np.min(arr), np.max(arr)    # Min and max values

(1, 4)

6. Dot Product, Matrix Ops

In [None]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
dot = np.dot(a, b)          # 1*4 + 2*5 + 3*6 = 32
mat = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
matprod = np.matmul(mat, mat2) # Matrix multiplication

7. Logical and Filtering

In [None]:
arr[arr > 2]                # Returns [3, 4]
np.where(arr > 2)           # Indices where condition is true

(array([2, 3], dtype=int64),)

8. Random Sampling

In [None]:
np.random.randint(0, 100, size=5) # 5 random ints 0-99
np.random.choice([0, 1], size=10) # 10 random 0s or 1s

array([0, 0, 1, 0, 1, 0, 0, 1, 1, 1])

#### Pandas Basics for Data Analysis
Import pandas:

In [None]:
import pandas as pd

1. Creating DataFrames

In [None]:
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data) # Create DataFrame

2. Inspecting Data

In [None]:
df.head()                # First 5 rows
df.tail(3)                # Last 3 rows
df.info()                 # DataFrame info + datatypes
df.describe()             # Summary statistics

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      int64
 1   B       3 non-null      int64
dtypes: int64(2)
memory usage: 180.0 bytes


Unnamed: 0,A,B
count,3.0,3.0
mean,2.0,5.0
std,1.0,1.0
min,1.0,4.0
25%,1.5,4.5
50%,2.0,5.0
75%,2.5,5.5
max,3.0,6.0


3. Selecting Data

In [None]:
df['A']                   # Get column A
df.loc[0]                 # Row by label (index 0)
df.iloc[1:3]              # Rows 1-2 by position
df[df['A'] > 1]           # Filter rows where A > 1

Unnamed: 0,A,B
1,2,5
2,3,6


4. Adding/Dropping Columns

In [None]:
df['C'] = df['A'] + df['B']     # Add new column
new_df = df.drop('B', axis=1)   # Drop column B

5. Missing Values & Types

In [None]:
df.isnull().sum()               # Missing values per column
df.fillna(0)                    # Replace NaNs with 0
df['A'] = df['A'].astype(float) # Change column type

6. Grouping and Aggregation

In [None]:
grouped = df.groupby('A').sum()   # Sum columns, grouped by values in A
df['A'].value_counts()            # Count unique values in A

A
1.0    1
2.0    1
3.0    1
Name: count, dtype: int64

7. Sorting

In [None]:
df.sort_values('B', ascending=False)  # Sort by column B

Unnamed: 0,A,B,C
2,3.0,6,9
1,2.0,5,7
0,1.0,4,5


8. Reading/Writing Files

In [None]:
df.to_csv('mydata.csv') # Save to CSV
df = pd.read_csv('mydata.csv') # Load from CSV
df.head()

Unnamed: 0.1,Unnamed: 0,age,salary,department,years_experience,is_manager
0,0,56,38392,IT,-0.8,0
1,1,46,60535,Marketing,3.4,1
2,2,32,108603,HR,5.0,1
3,3,25,82256,HR,4.2,1
4,4,38,119135,HR,4.1,1


9. Apply/Map Functions

In [None]:
df['A_squared'] = df['A'].apply(lambda x: x**2) # Apply to column
df['B_label'] = df['B'].map({4: 'Low', 5: 'Med', 6: 'High'}) # Map values
df.sample(3)

NameError: name 'df' is not defined

### Task:

- Generated a Synthetic Dataset with columns of `age`, `salary`, `department`, `years_experience`, and `is_manager`.

In [10]:
pip install numpy pandas

Collecting numpy
  Using cached numpy-2.3.4-cp314-cp314-win_amd64.whl.metadata (60 kB)
Collecting pandas
  Downloading pandas-2.3.3-cp314-cp314-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached numpy-2.3.4-cp314-cp314-win_amd64.whl (12.9 MB)
Downloading pandas-2.3.3-cp314-cp314-win_amd64.whl (11.1 MB)
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
    --------------------------------------- 0.3/11.1 MB ? eta -:--:--
   - -------------------------------------- 0.5/11.1 MB 2.1 MB/s eta 0:00:06
   --- ------------------------------------ 1.0/11.1 MB 2.3 MB/s eta 0:00:05
   ----- ---------------------------------- 1.6/11.1 MB 2.4 MB/s eta 0:00:05
   ------- -------------------------------- 2.1/11.1 MB 2.5 MB/s eta 0:00:04
   -------- ------------------------------- 2


[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [11]:
import numpy as np
import pandas as pd

# Set a random seed for reproducibility
np.random.seed(42) 

# Define the number of samples
n_samples = 1000 

# Generate the synthetic data
data = {
    # Age: Uniform distribution of integers between 18 and 59
    'age': np.random.randint(18, 60, size=n_samples),
    
    # Salary: Uniform distribution of integers between 30k and 119k
    'salary': np.random.randint(30000, 120000, size=n_samples),
    
    # Department: Randomly chosen categorical values
    'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing'], size=n_samples),
    
    # Years of Experience: Normal distribution, rounded to one decimal place
    # Note: Values are clipped at 0 since experience cannot be negative
    'years_experience': np.round(np.clip(np.random.normal(5, 2.5, size=n_samples), a_min=0, a_max=None), 1),
    
    # Is Manager: Binary choice (0 or 1)
    'is_manager': np.random.choice([0, 1], p=[0.8, 0.2], size=n_samples) # 20% managers
}

# Create the Pandas DataFrame
df = pd.DataFrame(data)

# Display the first 5 rows and the data information
print("--- First 5 Rows of the DataFrame ---")
print(df.head())
print("\n--- DataFrame Information ---")
print(df.info())

--- First 5 Rows of the DataFrame ---
   age  salary department  years_experience  is_manager
0   56   44382         HR               5.3           0
1   46  114291  Marketing               2.2           0
2   32   33756    Finance               0.6           1
3   25   50609    Finance               2.4           1
4   38   46478         IT               5.6           0

--- DataFrame Information ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               1000 non-null   int32  
 1   salary            1000 non-null   int32  
 2   department        1000 non-null   object 
 3   years_experience  1000 non-null   float64
 4   is_manager        1000 non-null   int64  
dtypes: float64(1), int32(2), int64(1), object(1)
memory usage: 31.4+ KB
None


Q1. View data structure

In [12]:
import pandas as pd
# Assuming the DataFrame 'df' has been created from the previous code block

# 1. View the first few rows (the data itself)
print("--- DataFrame Head (First 5 Rows) ---")
print(df.head())

# 2. View the structural summary (Data Types and Non-Null Counts)
print("\n--- DataFrame Structure (df.info()) ---")
print(df.info())

# 3. View descriptive statistics for numerical columns
print("\n--- Descriptive Statistics (df.describe()) ---")
print(df.describe())

--- DataFrame Head (First 5 Rows) ---
   age  salary department  years_experience  is_manager
0   56   44382         HR               5.3           0
1   46  114291  Marketing               2.2           0
2   32   33756    Finance               0.6           1
3   25   50609    Finance               2.4           1
4   38   46478         IT               5.6           0

--- DataFrame Structure (df.info()) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               1000 non-null   int32  
 1   salary            1000 non-null   int32  
 2   department        1000 non-null   object 
 3   years_experience  1000 non-null   float64
 4   is_manager        1000 non-null   int64  
dtypes: float64(1), int32(2), int64(1), object(1)
memory usage: 31.4+ KB
None

--- Descriptive Statistics (df.describe()) ---
               age       

Q2. Get DataFrame Info and Summary Stats

In [13]:
# 1. Get a concise summary of the DataFrame, including data types and non-null values
print("--- DataFrame Information (df.info()) ---")
df.info()

# 2. Get descriptive statistics for all numerical columns
print("\n--- Summary Statistics (df.describe()) ---")
print(df.describe())

# 3. Get descriptive statistics for all columns (including categorical/object types)
print("\n--- Summary Statistics for ALL Columns (df.describe(include='all')) ---")
print(df.describe(include='all'))

--- DataFrame Information (df.info()) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               1000 non-null   int32  
 1   salary            1000 non-null   int32  
 2   department        1000 non-null   object 
 3   years_experience  1000 non-null   float64
 4   is_manager        1000 non-null   int64  
dtypes: float64(1), int32(2), int64(1), object(1)
memory usage: 31.4+ KB

--- Summary Statistics (df.describe()) ---
               age         salary  years_experience   is_manager
count  1000.000000    1000.000000       1000.000000  1000.000000
mean     38.745000   75014.781000          4.959000     0.197000
std      12.186734   25872.805529          2.426423     0.397931
min      18.000000   30060.000000          0.000000     0.000000
25%      28.000000   54084.250000          3.275000     0.000000
50%      40.00000

Q3. Do Simple Numpy Operations

In [14]:
# 1. Calculate the Mean Salary (Aggregation)
mean_salary = df['salary'].mean()

# 2. Find the Maximum Age (Aggregation)
max_age = df['age'].max()

# 3. Calculate the Standard Deviation of Years of Experience (Aggregation)
std_experience = df['years_experience'].std()

# 4. Element-wise operation: Salary in Thousands (using NumPy array division)
# Accesses the underlying NumPy array for fast, element-wise division.
salary_in_thousands = df['salary'].values / 1000 
first_5_salaries_k = salary_in_thousands[:5]

# 5. Boolean/Logical Operation: Count employees over 40 
# The condition (df['age'] > 40) creates a boolean array (True/False). 
# .sum() treats True as 1 and False as 0, effectively counting True values.
employees_over_40 = (df['age'] > 40).sum()

# ... (printing results)

Q4. Filtering and Indexing Rows

In [15]:
# Create a boolean mask where years_experience > 7
high_experience_mask = df['years_experience'] > 7.0 

# Apply the mask to the DataFrame
high_experience_employees = df[high_experience_mask]

print("--- Employees with > 7 Years of Experience (First 5) ---")
print(high_experience_employees.head())
print(f"\nTotal employees: {len(high_experience_employees)}")

--- Employees with > 7 Years of Experience (First 5) ---
    age  salary department  years_experience  is_manager
10   41  106325    Finance               7.2           0
15   39   32443  Marketing              11.0           0
24   39   69915         HR               7.5           0
25   42   67219         HR              11.7           0
29   33   83028         IT               9.5           0

Total employees: 203


Q5. Adding a Column

In [16]:
# Create the new column 'salary_k'
df['salary_k'] = df['salary'] / 1000

print("--- DataFrame with 'salary_k' (First 5 Rows) ---")
print(df[['salary', 'salary_k']].head())

--- DataFrame with 'salary_k' (First 5 Rows) ---
   salary  salary_k
0   44382    44.382
1  114291   114.291
2   33756    33.756
3   50609    50.609
4   46478    46.478


Q6. Grouping and Aggregation

In [17]:
# Group by 'department' and calculate the mean of all numerical columns
average_salary_by_dept = df.groupby('department').mean(numeric_only=True)

print("--- Average Values Grouped by Department ---")
print(average_salary_by_dept[['salary', 'years_experience']])

--- Average Values Grouped by Department ---
                  salary  years_experience
department                                
Finance     78946.594714          5.034361
HR          74355.431907          4.934241
IT          74971.417671          4.999598
Marketing   72347.097378          4.880899
