# Data Toolkit Assignment

# Theory Questions

## 1. What is NumPy, and why is it widely used in Python?

NumPy is a Python library for numerical computing. It provides a powerful N-dimensional array object, tools for working with those arrays, and routines for fast mathematical operations. It is widely used because it is fast (written in C under the hood), supports vectorized operations (which are concise and efficient), and is the foundation for many other scientific libraries like pandas, SciPy, and scikit-learn.

## 2. How does broadcasting work in NumPy?

Broadcasting is a set of rules NumPy follows to perform operations between arrays of different shapes. Instead of throwing an error, NumPy 'stretches' the smaller array along the missing dimensions so that the shapes are compatible. This lets you write concise code without explicit loops, for example adding a 1D array to each row of a 2D array.

## 3. What is a Pandas DataFrame?

A DataFrame is a 2-dimensional labeled data structure in pandas, similar to a table or spreadsheet. It holds data in rows and columns with labels, supports different data types, and provides many methods for data cleaning, transformation, aggregation, and analysis.

## 4. Explain the use of the groupby() method in Pandas.

groupby() is used to split data into groups based on some criteria (one or more columns), apply a function (like sum, mean, count) to each group, and combine the results. It is very useful for aggregated summaries and pivot-like operations.

## 5. Why is Seaborn preferred for statistical visualizations?

Seaborn is built on top of Matplotlib and provides high-level functions for statistical plotting. It makes it easy to create attractive plots with built-in themes, handles data frames directly, and offers specialized plots for exploring relationships (like pairplot, violinplot, boxplot) which are helpful for statistical analysis.

## 6. What are the differences between NumPy arrays and Python lists?

NumPy arrays are homogeneous (all elements have the same data type), fixed-size, and stored in contiguous memory which enables fast numerical operations. Python lists are heterogeneous, can grow/shrink dynamically, and have higher memory overhead. For numerical work, NumPy arrays are typically much faster.

## 7. What is a heatmap, and when should it be used?

A heatmap is a 2D representation of data where values are shown as colors. It is useful for visualizing matrices like correlation matrices, confusion matrices, or any table of values where color intensity can reveal patterns and trends at a glance.

## 8. What does the term 'vectorized operation' mean in NumPy?

Vectorized operations are operations applied element-wise to arrays without explicit Python loops. NumPy performs these using optimized C code, which makes them much faster than equivalent Python loops.

## 9. How does Matplotlib differ from Plotly?

Matplotlib is a widely-used 2D plotting library for Python that creates static figures. Plotly focuses on interactive, web-based plots (zoom, hover, pan) and is often used for dashboards and interactive exploration. Matplotlib is simpler for static reports; Plotly is better when interactivity is needed.

## 10. What is the significance of hierarchical indexing in Pandas?

Hierarchical (multi-level) indexing allows multiple index levels on rows and/or columns. It helps represent higher-dimensional data in a 2D table and makes complex groupby/pivot operations and reshaping tasks easier.

## 11. What is the role of Seaborn’s pairplot() function?

pairplot() creates pairwise scatterplots and histograms for all numeric variables in a DataFrame. It helps quickly explore relationships between many pairs of variables and inspect distributions.

## 12. What is the purpose of the describe() function in Pandas?

describe() gives summary statistics for numeric (and optionally non-numeric) columns such as count, mean, std, min, quartiles, and max. It is the first quick step to understand data distributions.

## 13. Why is handling missing data important in Pandas?

Missing data can bias results, break calculations, or produce misleading visualizations. Handling missing values (via drop, fill, interpolation) ensures analyses are robust and correct.

## 14. What are the benefits of using Plotly for data visualization?

Plotly provides interactive plots easily shareable in notebooks and web apps. It supports hover info, zooming, and linking with dashboards. This interactivity helps in data exploration and presentation.

## 15. How does NumPy handle multidimensional arrays?

NumPy uses the ndarray object which supports any number of dimensions. Operations support axes, broadcasting rules, and efficient reshaping and indexing to manipulate multidimensional data.

## 16. What is the role of Bokeh in data visualization?

Bokeh is a Python library for interactive visualizations targeting web browsers. It creates interactive plots that can handle streaming data and large datasets and is often used for dashboards.

## 17. Explain the difference between apply() and map() in Pandas.

map() is used for element-wise transformations on a Series (e.g., mapping values via a dict or function). apply() is more general and can be used on Series or DataFrame; on DataFrames it can apply a function across rows or columns. apply() is more flexible but sometimes slower.

## 18. What are some advanced features of NumPy?

Advanced features include broadcasting, strides/view memory model, advanced indexing, linear algebra routines, FFT, random number generation, and masked arrays. These enable efficient scientific computations.

## 19. How does Pandas simplify time series analysis?

Pandas has dedicated datetime types, resampling, rolling/window functions, shifting, and frequency-aware indexing that make time series operations (like resample to monthly, compute rolling averages) straightforward.

## 20. What is the role of a pivot table in Pandas?

pivot_table() reshapes data to a summary table with aggregations across specified index and columns. It's useful for summarizing data like in spreadsheet pivot tables.

## 21. Why is NumPy’s array slicing faster than Python’s list slicing?

NumPy slicing returns views into contiguous memory and uses optimized C loops for operations. Python list slicing creates new Python objects and performs operations in Python-level loops, which is slower.

## 22. What are some common use cases for Seaborn?

Seaborn is commonly used for exploratory data analysis: distribution plots, categorical comparisons (box/violin), relationship plots (scatter/regression), matrix plots (heatmap), and pairwise relationships (pairplot).

#Practical Questions

## 1: Create a 2D NumPy array and calculate the sum of each row

In [None]:
import numpy as np
arr = np.array([[1,2,3],[4,5,6]])
arr
arr.sum(axis=1)

**Output:**

```
Array:
[[1 2 3]
 [4 5 6]]
Row sums: [ 6 15]
```

##  2: Find the mean of column 'A' in a DataFrame

In [None]:
import pandas as pd
df = pd.DataFrame({'A':[10,20,30,40], 'B':[1.5,2.5,3.0,4.0]})
df
mean_A = df['A'].mean()
mean_A

**Output:**

```
DataFrame:
 A    B
 10  1.5
 20  2.5
 30  3.0
 40  4.0
Mean of column A: 25.0
```

##  3: Create a scatter plot using Matplotlib

In [None]:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.randn(50)
y = 2*x + np.random.randn(50)*0.5
plt.scatter(x,y)
plt.title('Scatter plot (sample)')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

**Output:** Scatter plot image saved and shown below.

![](/mnt/data/scatter_plot.png)

## 4: Calculate correlation matrix and visualize as heatmap

In [None]:
import pandas as pd
sample_df = pd.DataFrame({'age':[23,45,12,36,52,24,33],'income':[500,1200,200,800,1500,600,700],'score':[60,80,40,70,90,65,68]})
corr = sample_df.corr()
corr

**Output:** Correlation matrix and heatmap image.

```
             age    income    score
age     1.000000  0.982643  0.97175
income  0.982643  1.000000  0.96662
score   0.971750  0.966620  1.00000
```

![](/mnt/data/corr_heatmap.png)

##  5: Generate a bar plot

In [None]:
categories = ['A','B','C']
values = [23,45,12]
import matplotlib.pyplot as plt
plt.bar(categories, values)
plt.title('Bar plot (sample)')
plt.show()

**Output:** Bar plot image.

![](/mnt/data/bar_plot.png)

##  6: Create a DataFrame and add a new column based on an existing column

In [None]:
import pandas as pd
df2 = pd.DataFrame({'marks':[45,67,78,89]})
df2['grade'] = df2['marks'].apply(lambda m: 'A' if m>=75 else ('B' if m>=60 else 'C'))
df2

**Output:**

```
 marks grade
   45     C
   67     B
   78     A
   89     A
```

##  7: Element-wise multiplication of two NumPy arrays

In [None]:
import numpy as np
a1 = np.array([1,2,3])
a2 = np.array([4,5,6])
a1 * a2

**Output:**

```
[ 4 10 18]
```

##  8: Create a line plot with multiple lines

In [None]:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,10,100)
y1 = np.sin(x)
y2 = np.cos(x)
plt.plot(x,y1)
plt.plot(x,y2)
plt.title('Line plot - sin and cos')
plt.show()

**Output:** Line plot saved.

![](/mnt/data/line_plot.png)

## 9: Filter rows where a column value is greater than a threshold

In [None]:
import pandas as pd
df3 = pd.DataFrame({'name':['a','b','c','d'], 'score':[55,75,45,85]})
df3
filtered = df3[df3['score']>60]
filtered

**Output:**

Original:

```
 name  score
    a     55
    b     75
    c     45
    d     85
```

Filtered (score>60):

```
 name  score
    b     75
    d     85
```

##  10: Create a histogram to visualize a distribution

In [None]:
import numpy as np
import matplotlib.pyplot as plt
data_hist = np.random.normal(loc=50, scale=10, size=200)
plt.hist(data_hist, bins=15)
plt.title('Histogram (sample)')
plt.show()

**Output:** Histogram image.

![](/mnt/data/hist_plot.png)

## 11: Perform matrix multiplication using NumPy

In [None]:
import numpy as np
m1 = np.array([[1,2],[3,4]])
m2 = np.array([[5,6],[7,8]])
m1.dot(m2)

**Output:**

```
[[19 22]
 [43 50]]
```

##  12: Use Pandas to load a CSV file and display its first 5 rows

In [None]:
import pandas as pd
sample_csv = pd.DataFrame({'id':[1,2,3,4,5], 'value':[10,20,30,40,50]})
sample_csv.to_csv('/mnt/data/sample_data.csv', index=False)
loaded = pd.read_csv('/mnt/data/sample_data.csv')
loaded.head()

**Output:**

```
 id  value
  1     10
  2     20
  3     30
  4     40
  5     50
```

## 13: Create a 3D scatter plot using Matplotlib

In [None]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
xs = np.random.randn(40)
ys = np.random.randn(40)
zs = np.random.randn(40)
ax.scatter(xs, ys, zs)
ax.set_title('3D scatter (matplotlib)')
plt.show()

**Output:** 3D scatter image.

![](/mnt/data/scatter3d.png)