## Structured Viva Voce Questionnaire for Python and Data Analysis

### Module 1: Python Basic Concepts and Programming

1. **What is the role of the Python interpreter, and how does it support code execution?**  
   **A:** The Python interpreter executes code line-by-line, converting it into machine-readable instructions. It supports interactive mode for testing and script mode for running programs, making debugging easier and eliminating compile-time requirements.

2. **What are Python identifiers and keywords, and what rules govern their usage?**  
   **A:** Identifiers are user-defined names for variables, functions, or classes, starting with a letter or underscore and avoiding special characters. Keywords are reserved words (e.g., `if`, `for`, `while`) with special meanings, which cannot be used as identifiers.

3. **How do expressions differ from statements in Python?**  
   **A:** Expressions are code snippets that produce a value (e.g., `2 + 3`), while statements are complete instructions performing actions (e.g., `if`, `for`). Expressions can be part of statements, but statements do not produce values.

4. **What is variable scope, and how is it managed in Python?**  
   **A:** Variable scope defines where a variable is accessible (local inside functions, global outside). The `global` keyword allows modifying global variables inside functions, ensuring proper access control.

5. **What are Python operators, and how does operator precedence affect computations?**  
   **A:** Operators perform operations like arithmetic (`+`, `-`), logical (`and`, `or`), and comparison (`==`). Precedence determines the order of operations (e.g., `*` before `+`), which can be overridden using parentheses for clarity.

6. **Why is indentation critical in Python, and what happens if it’s incorrect?**  
   **A:** Indentation defines code blocks (e.g., loops, functions) instead of braces, ensuring structural clarity. Incorrect indentation causes syntax errors, disrupting code execution.

7. **How does the `input()` function work, and how can inputs be processed?**  
   **A:** The `input()` function reads user input as a string from the console. It can be converted to other types using functions like `int()` or `float()` for further processing.

8. **What is the purpose of the `type()` function, and how is it used?**  
   **A:** The `type()` function returns the data type of a variable or value (e.g., `int`, `str`, `list`), aiding in debugging and type checking. For example, `type(10)` returns `<class 'int'>`.

9. **What is the `is` operator, and how does it differ from `==`?**  
   **A:** The `is` operator checks if two variables refer to the same object in memory, while `==` checks for value equality. It’s commonly used with `None` or to verify object identity.

10. **How do `if`, `elif`, `else`, and nested `if` statements facilitate decision-making?**  
    **A:** The `if` statement evaluates a condition, executing its block if true. `elif` checks additional conditions if prior ones fail, and `else` runs when no conditions are true. Nested `if` statements allow complex, hierarchical decision-making.

11. **What are `break` and `continue`, and how do they control loop execution?**  
    **A:** `break` exits a loop entirely, while `continue` skips the current iteration and proceeds to the next. Both are used to manage loop flow in `for` and `while` loops.

12. **What are default parameters, `*args`, and `**kwargs` in Python functions?**  
    **A:** Default parameters provide fallback values for arguments (e.g., `def func(x=10)`). `*args` accepts variable positional arguments, and `**kwargs` accepts variable keyword arguments, enabling flexible function definitions.

13. **How are command-line arguments handled in Python?**  
    **A:** The `sys.argv` list in the `sys` module captures command-line arguments, with `sys.argv[0]` being the script name and subsequent elements being passed arguments.

14. **What are the basic data types in Python, and how does type conversion work?**  
    **A:** Basic data types include `int`, `float`, `str`, `bool`, `list`, `tuple`, `dict`, and `set`. Type conversion changes one type to another, e.g., `int("5")` converts a string to an integer.

15. **What are `while` and `for` loops, and when are they used?**  
    **A:** A `while` loop runs as long as a condition is true, suitable for indefinite iterations. A `for` loop iterates over a sequence (e.g., list, string), ideal for known ranges or collections.

---

### Module 2: Python Collection Objects and Object-Oriented Programming

16. **How are strings created, stored, and manipulated in Python?**  
    **A:** Strings are created using single, double, or triple quotes (e.g., `'hello'`, `"hello"`) and stored as immutable character sequences. They support slicing (`string[start:end:step]`) and methods like `upper()`, `lower()`, `strip()`, and `split()` for manipulation, returning new strings due to immutability.

17. **What are lists, and how do they differ from tuples and sets?**  
    **A:** Lists are mutable, ordered collections created with square brackets (e.g., `[1, 2, 3]`), supporting duplicates. Tuples are immutable, ordered sequences created with parentheses (e.g., `(1, 2, 3)`), used for fixed data. Sets are mutable, unordered collections of unique elements created with `set()` or `{}`, supporting operations like union and intersection.

18. **What are common list operations and built-in functions?**  
    **A:** Lists support methods like `append()`, `remove()`, `sort()`, and `reverse()`. Built-in functions include `len()` (length), `max()` (maximum), `min()` (minimum), and `sum()` (total), simplifying list processing.

19. **How do dictionaries store and access data?**  
    **A:** Dictionaries store key-value pairs, created with `dict()` or `{}`, e.g., `{"name": "Alice", "age": 25}`. Keys provide fast lookups, and methods like `keys()`, `values()`, and `items()` facilitate data access.

20. **What is indexing and slicing, and how are they applied to lists and strings?**  
    **A:** Indexing accesses a single element by position (e.g., `list[0]`, `string[0]`), while slicing extracts a range (e.g., `list[1:3]`, `string[1:4]`). Both use zero-based indexing, with slicing supporting a step parameter (e.g., `[::2]`).

21. **What is a constructor, and how is it used in Python classes?**  
    **A:** The `__init__` method is a constructor, automatically called to initialize class attributes when an object is created. It uses `self` to refer to the instance, e.g., `def __init__(self, x): self.x = x`.

22. **What is inheritance, and how is it implemented in Python?**  
    **A:** Inheritance allows a class to inherit attributes and methods from a parent class, defined as `class Child(Parent)`. The `super().__init__()` method initializes the parent’s attributes, promoting code reuse.

23. **What is encapsulation, and how is it demonstrated in Python?**  
    **A:** Encapsulation restricts access to class attributes using protected (e.g., `_x`) or private (e.g., `__x`) variables. Python uses naming conventions (single underscore for protected, double for private) to control access, as seen in base-derived class interactions.

24. **What is operator overloading, and how is it implemented?**  
    **A:** Operator overloading defines custom behavior for operators like `+` using special methods (e.g., `__add__`). For example, a class can implement `__add__` to add specific attributes of two objects.

25. **How does method overloading work in Python?**  
    **A:** Python doesn’t support traditional method overloading but achieves it using default parameters, `*args`, or `**kwargs` to handle varying argument types or counts dynamically.

26. **How do you read from and write to files in Python, and what are the file modes?**  
    **A:** Use `open("file.txt", mode)` to access files. Mode `"r"` reads content with `read()` or `readlines()`, `"w"` overwrites with `write()`, and `"a"` appends. The `with` statement ensures proper file closure, e.g., `with open("file.txt", "r") as f: content = f.read()`.

---

### Module 3: Data Preprocessing and Data Wrangling

27. **What is data preprocessing and wrangling, and why are they important?**  
    **A:** Data preprocessing and wrangling involve cleaning, transforming, and organizing raw data into a usable format by handling missing values, normalizing data, and merging datasets. They ensure data quality and suitability for analysis, reducing errors and biases.

28. **How do you load and access data from CSV files and SQL databases in Python?**  
    **A:** Use `pd.read_csv("filename.csv")` to load CSV files into a Pandas DataFrame, with options for delimiters or encoding. For SQL databases, use `sqlite3` or `SQLAlchemy` to connect (e.g., `sqlite3.connect("database.db")`) and execute queries to retrieve data.

29. **What is data normalization, and how is it implemented in Python?**  
    **A:** Normalization scales data to a standard range (e.g., [0, 1]) using techniques like Min-Max scaling or standardization (subtract mean, divide by standard deviation). It’s implemented with NumPy or Scikit-learn’s `StandardScaler`, ensuring features contribute equally in analysis.

30. **How do you handle missing values in a Pandas DataFrame?**  
    **A:** Use `df.isnull()` to identify missing values and `df.isnull().sum()` to count them. Handle them with `df.dropna()` to remove rows with NaN or `df.fillna(value)` to replace NaN, e.g., `df.fillna(df.mean())` for numeric columns.

31. **What is data transformation, and how is it performed in Pandas?**  
    **A:** Data transformation modifies data for analysis, such as scaling, encoding categories, or creating new columns. Examples include applying functions with `df.apply(func)`, changing data types with `astype()`, or creating derived columns (e.g., `df["new_col"] = df["old_col"] * 2`).

32. **How do you merge, concatenate, and reshape DataFrames in Pandas?**  
    **A:** Merging combines DataFrames using `pd.merge(df1, df2, on="column")` with join types (inner, outer, etc.). Concatenation stacks DataFrames with `pd.concat([df1, df2])`. Reshaping uses `pivot_table()` for aggregation or `melt()` to unpivot data.

33. **What are regular expressions, and how are they used for data cleaning?**  
    **A:** Regular expressions (via the `re` module) match and manipulate string patterns, such as extracting numbers (`re.findall(r"\d+", text)`) or removing unwanted characters. In Pandas, `df["column"].str.replace(r"\s+", "")` cleans strings.

34. **How do you remove duplicates and strip extraneous information in Pandas?**  
    **A:** Use `df.drop_duplicates()` to remove duplicate rows. Strip extraneous information with string methods like `df["column"].str.strip()` for whitespace or `df["column"].str.replace("pattern", "")` for specific characters.

35. **What is a pivot table, and how is it created in Pandas?**  
    **A:** A pivot table summarizes data by grouping and aggregating based on columns, created with `df.pivot_table(values="col", index="row", columns="col", aggfunc="mean")`. It’s used to compute averages, sums, or other metrics by category.

---

### Module 4: Web Scraping and Numerical Analysis

36. **What is web scraping, and how is it performed in Python?**  
    **A:** Web scraping extracts data from websites using libraries like `BeautifulSoup` or `Scrapy`. It involves fetching HTML with `requests.get("url")` and parsing elements (e.g., tags, classes) using CSS selectors or methods like `soup.select(".class")`.

37. **How do you fetch web pages and submit forms programmatically?**  
    **A:** Fetch pages with `requests.get("url")`, accessing HTML via `response.text`. Submit forms using `requests.post("url", data={"field": "value"})` to send data, simulating user input for dynamic pages.

38. **What are CSS selectors, and how are they used in web scraping?**  
    **A:** CSS selectors identify HTML elements by tag, class, or ID (e.g., `.class`, `#id`). In `BeautifulSoup`, `soup.select(".class")` retrieves elements, enabling targeted data extraction.

39. **What is NumPy, and how does it enhance numerical analysis?**  
    **A:** NumPy is a library for numerical computations, providing efficient multi-dimensional arrays and functions for matrix operations, statistical analysis, and linear algebra. It outperforms Python lists with vectorized operations and memory optimization.

40. **How do you create, reshape, and slice NumPy arrays?**  
    **A:** Create arrays with `np.array([1, 2, 3])` or `np.zeros((2, 3))` for a 2×3 zero matrix. Reshape with `array.reshape(2, 3)`. Slice using `array[start:end]` or `array[:, 1]` for specific rows/columns.

41. **What is broadcasting in NumPy, and how does it work?**  
    **A:** Broadcasting enables operations on arrays of different shapes by stretching the smaller array’s dimensions. For example, adding a scalar to an array applies the scalar to each element.

---

### Module 5: Data Visualization with NumPy, Matplotlib, and Seaborn

42. **What is data visualization, and why is it important?**  
    **A:** Data visualization represents data graphically to identify patterns, trends, and insights. It simplifies complex data interpretation, aiding decision-making and communication.

43. **What is Matplotlib, and how is it used for plotting?**  
    **A:** Matplotlib is a 2D plotting library for creating static, animated, or interactive plots like line (`plt.plot(x, y)`), scatter (`plt.scatter(x, y)`), bar (`plt.bar(x, y)`), and histograms. It allows customization of titles (`plt.title()`), labels (`plt.xlabel()`), and sizes (`plt.figure(figsize=(w, h))`).

44. **How does Seaborn enhance Matplotlib, and what are its key features?**  
    **A:** Seaborn, built on Matplotlib, provides high-level, aesthetically pleasing statistical plots like heatmaps (`sns.heatmap()`), regression plots (`sns.regplot()`), and pair plots (`sns.pairplot()`). It simplifies complex visualizations and integrates with Pandas.

45. **How do you plot time series data and add annotations in Matplotlib?**  
    **A:** Plot time series with `df.plot()` for a Pandas DataFrame indexed by dates. Add annotations using `plt.text(x, y, "text")` for text at coordinates or `plt.annotate("text", xy=(x, y))` for text with arrows.

46. **What are scatter plots, histograms, and heatmaps, and what are their uses?**  
    **A:** Scatter plots (`plt.scatter()`) show relationships between two continuous variables. Histograms (`plt.hist()`) display the distribution of a numeric variable. Heatmaps (`sns.heatmap()`) visualize matrix data (e.g., correlations) with color intensity, highlighting patterns.

47. **How do you create advanced visualizations like pair plots and regression plots in Seaborn?**  
    **A:** Use `sns.pairplot(df, hue="category")` to plot pairwise relationships across numeric columns, colored by a categorical variable. Use `sns.regplot(x="col1", y="col2", data=df)` to show a scatter plot with a regression line, indicating variable relationships.

---

### Data Analysis with Pandas and NumPy

48. **What are Pandas and NumPy, and how do they support data analysis?**  
    **A:** Pandas provides DataFrames and Series for structured data manipulation, cleaning, and analysis (e.g., loading CSVs, grouping data). NumPy supports numerical computations with arrays, offering vectorized operations and statistical functions. Together, they handle tabular and numerical data efficiently.

49. **What are DataFrames and Series in Pandas, and how are they used?**  
    **A:** A DataFrame is a 2D labeled table, like an Excel sheet, created with `pd.DataFrame()`. A Series is a 1D labeled array, like a column, accessed as `df["column"]`. They support operations like filtering (`df[df["col"] > 10]`), grouping (`df.groupby()`), and merging.

50. **How do you perform data cleaning and manipulation in Pandas?**  
    **A:** Clean data by removing duplicates (`df.drop_duplicates()`), handling missing values (`df.fillna()`, `df.dropna()`), and changing types (`df.astype()`). Manipulate with sorting (`df.sort_values("col")`), renaming (`df.rename(columns={"old": "new"})`), and applying functions (`df.apply(func)`).

51. **What are `loc[]` and `iloc[]`, and how do they differ in Pandas?**  
    **A:** `loc[]` selects data by labels (e.g., `df.loc[0, "col"]`), while `iloc[]` uses integer positions (e.g., `df.iloc[0, 1]`). Both support slicing, but `loc[]` is label-based, and `iloc[]` is index-based.

52. **How do you perform statistical analysis in Pandas?**  
    **A:** Use `df.describe()` for summary statistics (mean, std, min, max), `df.corr()` for correlation matrices, and `df.value_counts()` for counting unique values. Aggregation with `df.groupby("col").agg("mean")` computes group-wise statistics.

53. **How does NumPy improve data analysis compared to Python lists?**  
    **A:** NumPy arrays are faster, memory-efficient, and support vectorized operations, unlike lists, which are slower for numerical tasks. Functions like `np.mean()`, `np.linspace()`, and `np.random.randint()` enable efficient computations and array generation.

---

### Machine Learning and Statistical Analysis

54. **What is statistical analysis, and how is it performed in Python?**  
    **A:** Statistical analysis examines data relationships and distributions using metrics like correlation (`df.corr()`) and regression. Python libraries like Pandas, NumPy, and Scikit-learn compute statistics (e.g., `np.mean()`) and model relationships (e.g., `LinearRegression()`).

55. **What is correlation, and how is it visualized?**  
    **A:** Correlation measures the linear relationship between variables (-1 to +1). It’s computed with `df.corr()` and visualized as a heatmap (`sns.heatmap(df.corr())`) to show pairwise strengths and directions.

56. **What are linear and logistic regression, and how are they implemented?**  
    **A:** Linear regression models continuous outcomes using `LinearRegression().fit(X, y)`, returning coefficients (`model.coef_`) and intercept (`model.intercept_`). Logistic regression handles binary classification, implemented with `LogisticRegression().fit(X, y)`, predicting class labels (`model.predict()`).

57. **What is Scikit-learn, and how is it used for machine learning?**  
    **A:** Scikit-learn is a machine learning library for tasks like classification, regression, and clustering. It provides models (e.g., `LinearRegression`, `LogisticRegression`) and utilities like `train_test_split` for splitting data and metrics like R², MAE, and MSE for evaluation.

58. **What are data splitting, feature scaling, and overfitting in machine learning?**  
    **A:** Data splitting divides data into training and testing sets using `train_test_split`. Feature scaling normalizes features (e.g., with `StandardScaler`) for equal contribution. Overfitting occurs when a model learns training data too well, performing poorly on new data.

---

### Algorithms and Specific Program Analysis

59. **What is linear search, its time complexity, and how does it handle absent elements?**  
    **A:** Linear search checks each element sequentially, with O(n) time complexity. If the element is absent, it iterates through the entire array, printing “Element not found” or using a flag to stop early, as seen in Q1.

60. **How does the insertion algorithm in a sorted list work, and what is its time complexity?**  
    **A:** The insertion algorithm (Q2) finds the correct position for a key in a sorted list and inserts it using slicing or concatenation, maintaining order. Its time complexity is O(n) due to linear search and slicing. Binary search could reduce search time to O(log n), but slicing remains O(n).

61. **What are `np.concatenate()`, `np.split()`, and `np.linspace()`, and how are they used?**  
    **A:** `np.concatenate([arr1, arr2])` combines arrays along an axis (Q5). `np.split(arr, n)` divides an array into n sub-arrays (Q5). `np.linspace(start, stop, num)` generates evenly spaced numbers for plotting smooth curves (Q5).

62. **What is `np.random.randint()`, and how is it used in matrix visualization?**  
    **A:** `np.random.randint(low, high, size)` generates random integers, used in Q7 to create an m×n matrix. Visualized with `plt.imshow(matrix, cmap="viridis")`, the color map (`cmap`) enhances interpretability.

63. **What does the `eval()` function do, and what are its risks?**  
    **A:** The `eval()` function evaluates a string as a Python expression, often used to parse input arrays (e.g., Q2). It poses security risks by executing arbitrary code, requiring caution with untrusted inputs.

---

### IMDB Dataset Analysis

64. **How is the IMDB dataset cleaned and preprocessed?**  
    **A:** Missing budget values are replaced with NaN, duplicates are removed with `df.drop_duplicates()`, and budget/revenue are adjusted to 2010 dollars for inflation. Multi-value columns like genres are split, often using the first genre for simplicity.

65. **What insights does the IMDB analysis reveal about budget, runtime, and popularity?**  
    **A:** High-budget films (above median) have 50% higher average popularity than low-budget ones, shown via scatter plots with weak linear correlation. Films with 100–200 minute runtimes are more popular, declining beyond 200 minutes, visualized with bar and scatter plots.

66. **How are genres and trends analyzed in the IMDB dataset?**  
    **A:** The first genre is extracted from pipe-separated values, and a custom function counts occurrences. Heatmaps (`sns.heatmap()`) show budget and revenue trends, revealing action/adventure’s rise post-2000.

67. **What is the role of `df.nlargest()` and profit calculation in the IMDB analysis?**  
    **A:** `df.nlargest(n, "column")` identifies top n rows (e.g., top 10 revenue films). Profit is calculated as revenue minus budget (adjusted values), analyzing its correlation with popularity.

68. **What are the limitations of the IMDB dataset analysis?**  
    **A:** Missing budget/revenue data and lack of currency standardization for international films skew results. IMDb’s proprietary popularity metric lacks transparency, limiting interpretability.

---

### General Data Analysis Concepts

69. **What is exploratory data analysis (EDA), and how is it conducted?**  
    **A:** EDA summarizes data characteristics through statistics (`df.describe()`) and visualizations (e.g., scatter plots, histograms). It identifies patterns, outliers, and relationships before modeling.

70. **What is outlier detection, and why is it important?**  
    **A:** Outlier detection identifies values deviating significantly from the dataset, using methods like Z-scores or IQR. It’s crucial to prevent skewed analysis and improve model accuracy.

71. **How do you evaluate a regression model’s performance?**  
    **A:** Use metrics like R² (proportion of variance explained), Mean Absolute Error (MAE), and Mean Squared Error (MSE), available in Scikit-learn, to assess how well the model predicts outcomes.
