#Data Toolkit Assignment

#Q1.What is NumPy, and why is it widely used in Python?

  - NumPy is a Python library for fast numerical computing. It provides:

     Multi-dimensional arrays (ndarray) that are faster than Python lists.

     Built-in math functions (e.g., stats, linear algebra).

     Vectorized operations (no slow loops).

     Memory efficiency (fixed data types).

     Integration with Pandas, SciPy, ML libraries.

  - Why use it?

     Speed: Optimized C/Fortran backend.

     Convenience: Clean syntax for math operations.

     Foundation: Used by almost every data science/AI library.

#Example:

     import numpy as np

     a = np.array([1, 2, 3])

     b = np.array([4, 5, 6])

     print(a + b)  # Output: [5 7 9] (fast element-wise addition)

#Output>>

      [5 7 9]


#Q2. How does broadcasting work in NumPy?

  - Broadcasting allows NumPy to perform operations on arrays of different
   shapes by automatically expanding the smaller array to match the larger one.

   Rules:

  - Dimensions must be compatible (either same size or one is 1).

  - Broadcasting happens from right to left.


   #Example:

      a = np.array([[1], [2], [3]])  # Shape (3,1)  

      b = np.array([10, 20, 30])     # Shape (3,)

      print(a + b)  

#Output>>>

    [[11 21 31]  

    [12 22 32]

    [13 23 33]]


#Q3.What is a Pandas DataFrame?

  - A Pandas DataFrame is a 2D table-like structure with rows and columns (like Excel).


#Example:

   import pandas as pd

   data = {"Name": ["Arif", "Kamran"], "Age": [26, 27]}  

   df = pd.DataFrame(data)  

   print(df)  

#Output>>>   

        Name  Age  
     0  Arif  26  
     1  Kamran   27

#Q4.Explain the use of the groupby() method in Pandas.

 - groupby() groups data based on a column and applies functions (like sum(), mean()).


#Example:

   data = {"Fruit": ["Apple", "Banana", "Apple"], "Price": [10, 20, 15]}  
   df = pd.DataFrame(data)  
   grouped = df.groupby("Fruit").mean()  
   print(grouped)  


#Output>>>

        Fruit    Price  
       
        Apple    12.5

        Banana   20.0

#Q5. Why is Seaborn preferred for statistical visualizations?


   Seaborn is built on Matplotlib and provides:

   - Easier syntax for complex plots.

   - Built-in themes for better aesthetics.

   - Statistical functions (e.g., regression plots, distribution plots).


#Example:

        import seaborn as sns  
tips = sns.load_dataset("tips")  
sns.boxplot(x="day", y="total_bill", data=tips)  



#Q6. What are the differences between NumPy arrays and Python lists?


    1 Feature 	       2 NumPy Arrays	        3 Python Lists

     Speed	       Faster (C backend)  	       Slower

     Memory Usage	More efficient	           Less efficient

     Operations	    Supports vectorized ops   	Needs loops

     Homogeneous	   All elements same type	      Can mix types

      

#Example:

    import numpy as np  
arr = np.array([1, 2, 3])  
lst = [1, 2, 3]  
print(arr * 2)  # [2 4 6]  
print([x * 2 for x in lst])  # [2, 4, 6]  



#Output>>

    [2 4 6]

    [2, 4, 6]


#Q7.What is a heatmap, and when should it be used?

  - A heatmap visualizes matrix-like data using colors.

  When to use:

  Correlation matrices.

  Confusion matrices.

  Any 2D data with intensity variations.

#Example:
  
   - A heatmap can show the correlation between different stock prices, where darker colors mean a stronger relationship.


#Output>>

   A grid-like chart with colors.


#Q8.What does the term “vectorized operation” mean in NumPy?

  - Vectorized operations in NumPy allow performing operations on entire arrays without explicit loops, making computations faster and more efficient.

#Example:

      import numpy as np

arr = np.array([1, 2, 3, 4])

result = arr * 2

print(result)

#Output>>

  [2 4 6 8]

#Q9. How does Matplotlib differ from Plotly?

  - Matplotlib is a static visualization library with simpler syntax, while Plotly creates interactive visualizations with hover effects and zoom capabilities.

  #Example:

  # Matplotlib

plt.plot([1, 2, 3, 4])

plt.title("Matplotlib Line Plot")

plt.show()

 # Plotly

fig = px.line(x=[1, 2, 3, 4], y=[1, 2, 3, 4], title="Plotly Line Plot")

fig.show()

#Output:>>>
  
   Matplotlib will show a static plot. Plotly will show an interactive plot in the browser.

#Q10. What is the significance of hierarchical indexing in Pandas?

 - Hierarchical indexing (MultiIndex) allows Pandas to have multiple index levels, enabling the representation of higher-dimensional data in a Series or DataFrame.


#Example:

   import pandas as pd

index = pd.MultiIndex.from_tuples([('G1', 1), ('G1', 2), ('G2', 1), ('G2', 2)],
names=['group', 'id'])

data = pd.Series([10, 20, 30, 40], index=index)

print(data)

 #Output:

             group  id

       G1     1     10
              2     20
       G2     1     30
              2     40
           dtype: int64


#Q11. What is the role of Seaborn's pairplot() function?

 - pairplot() creates a matrix of scatter plots showing the relationships between pairs of variables in a DataFrame, along with histograms on the diagonal.

 #Example:

  import seaborn as sns

import pandas as pd

data = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

sns.pairplot(data)

#Output>>>

 - A grid of plots showing scatter plots for each pair of columns and  
   histograms for individual columns.


#Q12. What is the purpose of the describe() function in Pandas?

 - describe() provides summary statistics like mean, standard deviation, min, max, etc., for numeric columns.

#Example:

  import pandas as pd

df = pd.DataFrame({'Age': [21, 25, 30, 35, 40]})

print(df.describe())

#Output:>>>

                Age   
     count    5.000000
     mean    30.200000
     std      7.620852
     min     21.000000
     25%     25.000000
     50%     30.000000
     75%     35.000000
     max     40.000000

#Q13. Why is handling missing data important in Pandas?

 - Missing data can lead to biased analyses and incorrect conclusions. Pandas provides tools to clean, fill, or remove missing values to ensure data quality.

 #Example:

    import pandas as pd

import numpy as np

data = pd.Series([1, 2, np.nan, 4])
print(data.fillna(0)) # Fill missing with 0


 #Output:>>>

   0    1.0
   1    2.0
   2    0.0
   3    4.0

#Q14. What are the benefits of using Plotly for data visualization?

  - Interactive plots (hover, zoom, tooltips)

  - Beautiful and easy-to-share visualizations

  - Works well in dashboards and web apps

#Example:

  import plotly.express as px

fig = px.bar(x=['A', 'B', 'C'], y=[10, 20, 30])

fig.show()



#Q15. How does NumPy handle multidimensional arrays?

  - NumPy's core object is the ndarray, which can represent arrays of any number of dimensions. It provides efficient storage and operations for these arrays.


#Output:>>>

   import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]]) # 2D array

print(arr)

#Output:>>>
   
   [[1 2 3]
   [4 5 6]]


#Q16. What is the role of Bokeh in data visualization?

  -  Bokeh is a Python library for creating interactive web-based plots. It focuses on interactivity and producing elegant graphics.

  - Role: Creating interactive plots for modern web browsers.

#Example:
  
 from bokeh.plotting import figure, show

# Create a simple interactive line plot

p = figure(title="Bokeh Line Plot", x_axis_label='X', y_axis_label='Y')

p.line([1, 2, 3, 4], [4, 7, 1, 6], line_width=2)

show(p)

 #Output:>>>

   An interactive line chart opens in your browser where you can:

Zoom

Pan

Hover over points (if tooltips are enabled)


#Q17. Explain the difference between apply() and map() in Pandas.

 - map() is used for substituting each value in a Series with another value.

  apply() is used to apply a function along an axis of a DataFrame or Series.

#Exaple:

  import pandas as pd

s = pd.Series([1, 2, 3])

print(s.map({1: 'A', 2: 'B', 3: 'C'}))  # map

print(s.apply(lambda x: x * 2))  # apply


#Output:>>>

     0    A
     1    B
     2    C
   dtype: object

     0    2
     1    4
     2    6
   dtype: int64


#Q18. hat are some advanced features of NumPy?

 - NumPy offers powerful advanced features such as:

 - Broadcasting: Perform operations on arrays of different shapes.

 - Vectorization: Apply operations to entire arrays without loops.

 - Linear Algebra: Matrix multiplication, inversion, eigenvalues, etc.

 - FFT (Fast Fourier Transform): For signal processing.

 - Masked Arrays: Handling invalid or missing data.

 - Structured Arrays: Arrays with named fields (like a table).

#Example (Linear Algebra - Matrix Inversion):

  import numpy as np

a = np.array([[1, 2], [3, 4]])

inv = np.linalg.inv(a)

print(inv)

  
#Output:>>>

  [[-2.   1. ]
 [ 1.5 -0.5]]

#Q19. How does Pandas simplify time series analysis?

 - Pandas makes time series analysis easy by providing:

 - Datetime indexing with automatic parsing

 - Resampling to change frequency (e.g., daily to monthly)

 - Rolling windows for moving averages

 - Shifting for lag/lead analysis

 - Built-in support for date-based filtering and plotting

#Output:

 import pandas as pd

# Create a time series

dates = pd.date_range('2024-01-01', periods=3, freq='D')

df = pd.DataFrame({'sales': [100, 150, 200]}, index=dates)

# Resample to hourly data and forward-fill missing values

resampled = df.resample('H').ffill()

print(resampled.head(5))


#Output:

                       sales

   2024-01-01 00:00:00    100

   2024-01-01 01:00:00    100

   2024-01-01 02:00:00    100

   2024-01-01 03:00:00    100

   2024-01-01 04:00:00    100



#Q20. A What is the role of a pivot table in Pandas?

 - A pivot table in Pandas is used to summarize and group data. It allows you to rearrange data based on columns and apply aggregation functions like sum, mean, count, etc.

 #Example:

   import pandas as pd

# Create a sample DataFrame

df = pd.DataFrame({
    'City': ['A', 'A', 'B', 'B'],
    'Year': [2022, 2023, 2022, 2023],
    'Sales': [100, 150, 200, 250]
})

# Create pivot table

pivot = df.pivot_table(values='Sales', index='City', columns='Year', aggfunc='sum')

print(pivot)


#Outoput:>>>

      Year  2022  2023
      City             
      A       100   150
      B       200   250


#Q21. Why is NumPy’s array slicing faster than Python’s list slicing?

  NumPy’s array slicing is faster because:

 - NumPy arrays are stored in contiguous memory, allowing direct access to slices without creating a copy.

 - Implemented in C, optimized for performance.

 - Supports vectorized operations, reducing overhead.

In contrast, Python lists are more flexible but less memory-efficient and slower for numerical tasks.

 #example:
  
  import numpy as np

import time

# NumPy array slicing

arr = np.arange(1000000)

start = time.time()

slice_np = arr[100:200]

end = time.time()

print("NumPy slicing time:", end - start)

# Python list slicing

lst = list(range(1000000))

start = time.time()

slice_list = lst[100:200]

end = time.time()

print("List slicing time:", end - start)

 #Output:

   NumPy slicing time: 1.2e-06

   List slicing time: 6.5e-06
  
#Q22. What are some common use cases for Seaborn?

 - Seaborn is a powerful data visualization library built on top of Matplotlib. It’s commonly used for:

 - Statistical visualizations (box plots, violin plots, bar plots)

 - Visualizing distributions (histograms, KDE plots)

 - Exploring relationships (scatter plots, regression plots)

 - Correlation heatmaps

 - Pairwise relationships using pairplot()

 #Example:

  import seaborn as sns

import matplotlib.pyplot as plt

# Load built-in dataset
tips = sns.load_dataset('tips')

# Box plot
sns.boxplot(x='day', y='total_bill', data=tips)

plt.title("Box plot of total bill per day")

plt.show()

# Correlation heatmap
corr = tips.corr(numeric_only=True)

sns.heatmap(corr, annot=True, cmap='coolwarm')

plt.title("Correlation Heatmap")

plt.show()

#Output:>>>

 - First plot: A box plot comparing total bills across days.

 - Second plot: A heatmap showing correlations between numerical columns.


#Practical Questions

In [None]:
#Q1.How do you create a 2D NumPy array and calculate the sum of each row?

'''
import numpy as np

# Create 2D array
arr = np.array([[1, 2, 3], [4, 5, 6]])

# Sum of each row
row_sum = arr.sum(axis=1)

print("2D Array:\n", arr)
print("Row-wise sum:", row_sum)


#Output:>>>

2D Array:
 [[1 2 3]
 [4 5 6]]
Row-wise sum: [ 6 15]

'''

In [None]:
#Q2. Write a Pandas script to find the mean of a specific column in a DataFrame.

'''
import pandas as pd

# Create DataFrame
df = pd.DataFrame({'Marks': [70, 80, 90, 85]})

# Mean of 'Marks'
mean_marks = df['Marks'].mean()

print("Mean Marks:", mean_marks)


#Output:>>>

Mean Marks: 81.25

'''


In [None]:
#Q3.Create a scatter plot using Matplotlib.

'''
import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y = [10, 20, 25, 30]

plt.scatter(x, y)
plt.title("Scatter Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()


#Output:>>>

 A scatter plot with 4 points plotted on a 2D grid.

'''

In [None]:
#Q4.A How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap.

'''

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Sample data
df = pd.DataFrame({
    'Math': [90, 80, 70, 85],
    'Science': [88, 76, 95, 90],
    'English': [70, 85, 78, 82]
})

# Correlation matrix
corr = df.corr()

# Heatmap
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix Heatmap")
plt.show()


#Output:>>>

 A heatmap showing correlation coefficients between subjects.

'''


In [None]:
#Q5. Generate a bar plot using Plotly.

'''
import plotly.express as px

# Sample data
data = {'Category': ['A', 'B', 'C', 'D'], 'Values': [10, 15, 7, 12]}

# Create a bar plot
fig = px.bar(data, x='Category', y='Values', title='Bar Plot Example')

# Show the plot
fig.show()


#Output:>>>

An interactive bar plot will be displayed in the browser.

'''

In [None]:
#Q6. Create a DataFrame and add a new column based on an existing column.

'''
import pandas as pd

# Create a DataFrame
data = {'Product': ['X', 'Y', 'Z'], 'Price': [10, 20, 15]}
df = pd.DataFrame(data)

# Add a new column 'Discounted_Price' (e.g., 10% off)
df['Discounted_Price'] = df['Price'] * 0.9

print("DataFrame with new column:\n", df)

#Output:>>>

DataFrame with new column:
   Product  Price  Discounted_Price
0      X       10            9.0
1      Y       20            18.0
2      Z       15            13.5

'''

In [None]:
#Q7.Write a program to perform element-wise multiplication of two NumPy arrays.

'''
import numpy as np

# Create two NumPy arrays
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# Perform element-wise multiplication
result = arr1 * arr2

print("Array 1:", arr1)
print("Array 2:", arr2)
print("Element-wise multiplication:", result)


#Output:>>>

Array 1: [1 2 3]
Array 2: [4 5 6]
Element-wise multiplication: [ 4 10 18]

'''

In [None]:
#Q8. Create a line plot with multiple lines using Matplotlib.

'''

import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4]
y1 = [1, 4, 9, 16]
y2 = [1, 2, 3, 4]

# Create the line plot
plt.plot(x, y1, label='Line 1')
plt.plot(x, y2, label='Line 2')

# Add labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Multiple Line Plot')

# Add legend
plt.legend()

# Show the plot
plt.show()

#Output:>>>

A line plot with two lines will be displayed.

'''

In [None]:
#Q9.Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold.

'''

import pandas as pd

# Create a Pandas DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85, 92, 78, 88]}
df = pd.DataFrame(data)

# Filter rows where 'Score' is greater than 85
filtered_df = df[df['Score'] > 85]

print("Original DataFrame:\n", df)
print("Filtered DataFrame (Score > 85):\n", filtered_df)


#Output:>>>

Original DataFrame:
       Name  Score
0    Alice     85
1      Bob     92
2  Charlie     78
3    David     88
Filtered DataFrame (Score > 85):
     Name  Score
1    Bob     92
3  David     88

'''

In [None]:
#Q10.Create a histogram using Seaborn to visualize a distribution.

'''
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data
data = [55, 60, 65, 70, 70, 75, 80, 85, 90, 95]

# Create histogram
sns.histplot(data, bins=5, kde=True)
plt.title("Histogram with KDE")
plt.xlabel("Scores")
plt.ylabel("Frequency")
plt.show()


#Output:>>>

A histogram showing the frequency of values, with an optional KDE curve.
'''

In [None]:
#Q11. Perform matrix multiplication using NumPy.

'''
import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

result = np.dot(A, B)
print("Matrix multiplication result:\n", result)


#Output:>>>

Matrix multiplication result:
 [[19 22]
 [43 50]]

'''

In [None]:
#Q12.Use Pandas to load a CSV file and display its first 5 rows.


'''
import pandas as pd

# Assuming you have a CSV file named 'data.csv' in the same directory
# Create a dummy CSV for example
data = {'col1': [1, 2, 3, 4, 5, 6], 'col2': [7, 8, 9, 10, 11, 12]}
df = pd.DataFrame(data)
df.to_csv('data.csv', index=False)


# Load the CSV file
df = pd.read_csv('data.csv')

# Display the first 5 rows
print(df.head())

#Output:>>>

   col1  col2
0     1     7
1     2     8
2     3     9
3     4    10
4     5    11

'''

In [None]:
#Q13.Create a 3D scatter plot using Plotly.

'''
import plotly.express as px
import numpy as np

# Sample data
data = {'x': np.random.rand(100), 'y': np.random.rand(100), 'z': np.random.rand(100)}

# Create the 3D scatter plot
fig = px.scatter_3d(data, x='x', y='y', z='z', title='3D Scatter Plot')

# Show the plot
fig.show()

#Output:>>>

An interactive 3D scatter plot will be displayed in the browser.

'''