### Essential NumPy and Pandas Functions for Data Analysis

Let's look at practical code snippets that demonstrate the most useful NumPy and pandas methods for data analysis in Python. You can use these in a Jupyter notebook for experimentation and workflow mastery.

#### NumPy Basics for Data Analysis
Import NumPy:

In [1]:
import pandas as pd

In [2]:
import numpy as np

1. Creating Arrays

In [3]:
arr = np.array([1, 2, 3, 4])            # 1D array
matrix = np.array([[1, 2], [3, 4]])     # 2D array
zeros = np.zeros((2, 3))                # 2x3 array of zeros
ones = np.ones((2, 3))                  # 2x3 array of ones
rnd = np.random.rand(2, 3)              # 2x3 array of random floats

2. Array Shape, Reshape, Flatten

In [4]:
print(arr.shape)                # (4,)
reshaped = matrix.reshape((4, 1))   # Reshape to 4 rows, 1 column
flat = matrix.ravel()               # Flatten to 1D array

(4,)


3. Indexing and Slicing

In [5]:
print(matrix[0, 1])    # element at first row, second column
print(arr[1:3])        # slice elements 1 to 2

2
[2 3]


4. Basic Math Operations

In [6]:
a = np.array([10, 20, 30])
b = np.array([1, 2, 3])
print(a + b)           # [11 22 33]
print(a - b)           # [ 9 18 27]
print(a * b)           # [10 40 90]
print(a / b)           # [10. 10. 10.]


[11 22 33]
[ 9 18 27]
[10 40 90]
[10. 10. 10.]


5. Aggregation and Stats

In [7]:
np.sum(arr)                 # 10
np.mean(arr)                # 2.5
np.std(arr)                 # Standard deviation
np.min(arr), np.max(arr)    # Min and max values

(np.int64(1), np.int64(4))

6. Dot Product, Matrix Ops

In [8]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
dot = np.dot(a, b)          # 1*4 + 2*5 + 3*6 = 32
mat = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
matprod = np.matmul(mat, mat2) # Matrix multiplication

7. Logical and Filtering

In [9]:
arr[arr > 2]                # Returns [3, 4]
np.where(arr > 2)           # Indices where condition is true

(array([2, 3]),)

8. Random Sampling

In [10]:
np.random.randint(0, 100, size=5) # 5 random ints 0-99
np.random.choice([0, 1], size=10) # 10 random 0s or 1s

array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0])

#### Pandas Basics for Data Analysis
Import pandas:

In [11]:
import pandas as pd

1. Creating DataFrames

In [12]:
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data) # Create DataFrame

2. Inspecting Data

In [13]:
df.head()                # First 5 rows
df.tail(3)                # Last 3 rows
df.info()                 # DataFrame info + datatypes
df.describe()             # Summary statistics

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      int64
 1   B       3 non-null      int64
dtypes: int64(2)
memory usage: 180.0 bytes


Unnamed: 0,A,B
count,3.0,3.0
mean,2.0,5.0
std,1.0,1.0
min,1.0,4.0
25%,1.5,4.5
50%,2.0,5.0
75%,2.5,5.5
max,3.0,6.0


3. Selecting Data

In [14]:
df['A']                   # Get column A
df.loc[0]                 # Row by label (index 0)
df.iloc[1:3]              # Rows 1-2 by position
df[df['A'] > 1]           # Filter rows where A > 1

Unnamed: 0,A,B
1,2,5
2,3,6


4. Adding/Dropping Columns

In [15]:
df['C'] = df['A'] + df['B']     # Add new column
new_df = df.drop('B', axis=1)   # Drop column B

5. Missing Values & Types

In [16]:
df.isnull().sum()               # Missing values per column
df.fillna(0)                    # Replace NaNs with 0
df['A'] = df['A'].astype(float) # Change column type

6. Grouping and Aggregation

In [17]:
grouped = df.groupby('A').sum()   # Sum columns, grouped by values in A
df['A'].value_counts()            # Count unique values in A

Unnamed: 0_level_0,count
A,Unnamed: 1_level_1
1.0,1
2.0,1
3.0,1


7. Sorting

In [18]:
df.sort_values('B', ascending=False)  # Sort by column B

Unnamed: 0,A,B,C
2,3.0,6,9
1,2.0,5,7
0,1.0,4,5


8. Reading/Writing Files

In [19]:
df.to_csv('mydata.csv') # Save to CSV
df = pd.read_csv('mydata.csv') # Load from CSV
df.head()

Unnamed: 0.1,Unnamed: 0,A,B,C
0,0,1.0,4,5
1,1,2.0,5,7
2,2,3.0,6,9


9. Apply/Map Functions

In [20]:
df['A_squared'] = df['A'].apply(lambda x: x**2) # Apply to column
df['B_label'] = df['B'].map({4: 'Low', 5: 'Med', 6: 'High'}) # Map values
df.sample(3)

Unnamed: 0.1,Unnamed: 0,A,B,C,A_squared,B_label
1,1,2.0,5,7,4.0,Med
2,2,3.0,6,9,9.0,High
0,0,1.0,4,5,1.0,Low


### Task:

- Generated a Synthetic Dataset with columns of `age`, `salary`, `department`, `years_experience`, and `is_manager`.

In [None]:
np.random.seed(42) # For reproducibility
n_samples = 100 # Number of samples

data = {
    'age': np.random.randint(18, 60, size=n_samples),
    'salary': np.random.randint(30000, 120000, size=n_samples),
    'department': np.random.choice(['IT', 'HR', 'Finance', 'Marketing'], size=n_samples),
    'years_experience': np.round(np.random.normal(5, 2, size=n_samples), 1),
    'is_manager': np.random.choice([0, 1], size=n_samples)
}
df = pd.DataFrame(data)

Q1. View data structure

In [22]:
# View the first 5 rows
print("First 5 rows:")
print(df.head())

# View the last 5 rows
print("\nLast 5 rows:")
print(df.tail())

# Get the shape of the DataFrame
print("\nDataFrame shape:", df.shape)

# Get column names
print("\nColumn names:", df.columns.tolist())

First 5 rows:
   age  salary department  years_experience  is_manager
0   56   38392         IT              -0.8           0
1   46   60535  Marketing               3.4           1
2   32  108603         HR               5.0           1
3   25   82256         HR               4.2           1
4   38  119135         HR               4.1           1

Last 5 rows:
    age  salary department  years_experience  is_manager
95   59   82662         IT               4.0           0
96   56   42688         HR               4.4           1
97   58   55342  Marketing               6.0           0
98   45   67157         HR              11.4           0
99   24   97863         HR               5.2           1

DataFrame shape: (100, 5)

Column names: ['age', 'salary', 'department', 'years_experience', 'is_manager']


Q2. Get DataFrame Info and Summary Stats

In [23]:
# Get DataFrame info
print("DataFrame Info:")
df.info()

# Get summary statistics
print("\nSummary Statistics:")
print(df.describe())

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               100 non-null    int64  
 1   salary            100 non-null    int64  
 2   department        100 non-null    object 
 3   years_experience  100 non-null    float64
 4   is_manager        100 non-null    int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 4.0+ KB

Summary Statistics:
              age         salary  years_experience  is_manager
count  100.000000     100.000000        100.000000  100.000000
mean    37.910000   77809.160000          4.823000    0.470000
std     12.219454   26058.643576          2.237822    0.501614
min     18.000000   30206.000000         -0.800000    0.000000
25%     26.750000   55141.000000          3.475000    0.000000
50%     38.000000   80932.000000          4.700000    0.000000
75%     46.250000   98107.25

Q3. Do Simple Numpy Operations

In [24]:
# Convert columns to numpy arrays
age_array = df['age'].values
salary_array = df['salary'].values

# Basic statistics using NumPy
print(f"Mean age: {np.mean(age_array):.2f}")
print(f"Median age: {np.median(age_array):.2f}")
print(f"Standard deviation of age: {np.std(age_array):.2f}")

print(f"\nMean salary: ${np.mean(salary_array):,.2f}")
print(f"Min salary: ${np.min(salary_array):,}")
print(f"Max salary: ${np.max(salary_array):,}")

print(f"\nTotal payroll: ${np.sum(salary_array):,}")

Mean age: 37.91
Median age: 38.00
Standard deviation of age: 12.16

Mean salary: $77,809.16
Min salary: $30,206
Max salary: $119,474

Total payroll: $7,780,916


Q4. Filtering and Indexing Rows

In [25]:
# Filter employees older than 40
older_than_40 = df[df['age'] > 40]
print(f"Employees older than 40: {len(older_than_40)} employees")
print(older_than_40.head())

# Filter managers with high salaries
high_paid_managers = df[(df['is_manager'] == 1) & (df['salary'] > 80000)]
print(f"\n\nManagers with salary > $80,000: {len(high_paid_managers)} employees")
print(high_paid_managers.head())

# Get specific row by index
print("\n\nEmployee at index 10:")
print(df.iloc[10])

Employees older than 40: 45 employees
    age  salary department  years_experience  is_manager
0    56   38392         IT              -0.8           0
1    46   60535  Marketing               3.4           1
5    56   65222    Finance               2.8           0
10   41   40965         IT               9.5           1
11   53   54538  Marketing               5.3           0


Managers with salary > $80,000: 25 employees
   age  salary department  years_experience  is_manager
2   32  108603         HR               5.0           1
3   25   82256         HR               4.2           1
4   38  119135         HR               4.1           1
6   36  107373    Finance               6.7           1
8   28  114651  Marketing               5.8           1


Employee at index 10:
age                    41
salary              40965
department             IT
years_experience      9.5
is_manager              1
Name: 10, dtype: object


Q5. Adding a Column

In [26]:
# Add a new column: salary per year of experience
df['salary_per_year_exp'] = df['salary'] / (df['years_experience'] + 1)

# Add age group column
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 40, 50, 100],
                          labels=['Young', 'Mid', 'Senior', 'Veteran'])

print("DataFrame with new columns:")
print(df.head(10))

DataFrame with new columns:
   age  salary department  years_experience  is_manager  salary_per_year_exp  \
0   56   38392         IT              -0.8           0        191960.000000   
1   46   60535  Marketing               3.4           1         13757.954545   
2   32  108603         HR               5.0           1         18100.500000   
3   25   82256         HR               4.2           1         15818.461538   
4   38  119135         HR               4.1           1         23359.803922   
5   56   65222    Finance               2.8           0         17163.684211   
6   36  107373    Finance               6.7           1         13944.545455   
7   40  109575  Marketing               6.9           0         13870.253165   
8   28  114651  Marketing               5.8           1         16860.441176   
9   28   93335         IT               1.2           1         42425.000000   

  age_group  
0   Veteran  
1    Senior  
2       Mid  
3     Young  
4       Mid  
5   Vet

Q6. Grouping and Aggregation

In [27]:
# Group by department and get statistics
dept_stats = df.groupby('department').agg({
    'salary': ['mean', 'min', 'max', 'count'],
    'age': 'mean',
    'years_experience': 'mean',
    'is_manager': 'sum'
})

print("Statistics by Department:")
print(dept_stats)

# Average salary by age group
print("\n\nAverage Salary by Age Group:")
age_group_salary = df.groupby('age_group')['salary'].mean().sort_values(ascending=False)
print(age_group_salary)

# Count of employees by department
print("\n\nEmployee Count by Department:")
print(df['department'].value_counts())

# Manager vs Non-Manager comparison
print("\n\nAverage Salary: Manager vs Non-Manager")
manager_comparison = df.groupby('is_manager')['salary'].agg(['mean', 'count'])
manager_comparison.index = ['Non-Manager', 'Manager']
print(manager_comparison)

Statistics by Department:
                  salary                             age years_experience  \
                    mean    min     max count       mean             mean   
department                                                                  
Finance     83124.708333  30206  117897    24  39.416667         4.416667   
HR          73523.052632  32693  119135    19  37.157895         5.089474   
IT          75825.476190  35530  117455    21  36.809524         4.390476   
Marketing   77684.722222  30854  119474    36  37.944444         5.205556   

           is_manager  
                  sum  
department             
Finance            10  
HR                 10  
IT                 13  
Marketing          14  


Average Salary by Age Group:
age_group
Mid        88448.720000
Young      81191.900000
Senior     71891.208333
Veteran    67073.904762
Name: salary, dtype: float64


Employee Count by Department:
department
Marketing    36
Finance      24
IT           21
HR       

  age_group_salary = df.groupby('age_group')['salary'].mean().sort_values(ascending=False)
