## Common Data Science Interview Questions

#### **QUESTION 1**

**You are given a table T that contains two columns:**
- company
- country

**You need to count the number of companies in each country using:**
1. Python
2. SQL


In [1]:
import pandas as pd
companies = ['Google', 'Meta', 'Yandex', 'IBM']
countries = ['USA', 'USA', 'Russia', 'USA']
table = pd.DataFrame({'company': companies, 'country': countries})
table

Unnamed: 0,company,country
0,Google,USA
1,Meta,USA
2,Yandex,Russia
3,IBM,USA


**First solution (with Python)**

In [9]:
table.groupby('country').agg({'company': 'count'}).reset_index()

Unnamed: 0,country,company
0,Russia,1
1,USA,3


**Second solution (with SQL)**

<code>SELECT country, COUNT(company) as company_count<br>
FROM T<br>
GROUP BY country;</code>

#### **QUESTION 2**

#### What metrics do you know for a classification task?

1. **Accuracy**: the proportion of correctly classified instances out of the total instances.</br>

$$
\frac{TP}{TP + TN + FP + FN}
$$

2. **Precision**: the number of true positive predictions divided by the total number of positive predictions.

$$
\frac{TP}{TP + FP}
$$

3. **Recall**: the number of true positive predictions divided by the total number of actual positive instances.

$$
\frac{TP}{TP + FN}
$$

4. **F1 Score**: The harmonic mean of precision and recall, providing a balance between the two metrics.

5. **ROC-AUC**: shows the performance of a classification model at all classification thresholds x=FPR, y=TPR

#### How to graph a ROC?

**Obtain Predictions and True Labels:**</br>
Get the predicted probabilities or scores from your classification model. These scores represent the likelihood of belonging to the positive class.
Have the true labels for your dataset, indicating the actual class membership.

**Sort Data Points:**</br>
Sort your data points based on the predicted probabilities in descending order. This means the instances with the highest predicted probabilities are at the beginning of the list.

**Initialize ROC Curve:**</br>
Start with a ROC curve that begins at the point (0,0).

**Iterate Through Data Points:**</br>
For each data point, check its true label. If the true label is positive, move the curve upward (increasing the True Positive Rate) by a step equal to one divided by the number of actual positive instances. If the true label is negative, move the curve to the right (increasing the False Positive Rate) by a step equal to one divided by the number of actual negative instances.

**Plot the Points:**</br>
Plot each point as you move through the sorted data points. This creates the ROC curve.

**Connect the Dots:**</br>
Once you have plotted all points, connect them to form the ROC curve.

#### **QUESTION 3**

#### What can you say about the depth of trees in Random Forest and Boosting?

**Random Forest:**
- Trees have significant depth

**Boosting:**
- Trees are typically limited in depth (often 3-5)

#### Why?

**Bias–variance tradeoff** <br>

**RF:** 
- The base model needs to have a low bias, which can be achieved with deeper trees.
- Variance is reduced by the algorithm. <br>

**Boosting**:
- The base model needs to have a low variance, which can be achieved with simple (not deep) trees.
- Bias is reduced by the algorithm.


#### **QUESTION 4**

#### Explain the application of the <code>.rolling()</code> method and demonstrate how to use it on the provided data.

In [9]:
import pandas as pd

# Generate a date range for seven days starting from '2023-01-01'
date_range = pd.date_range('2023-01-01', periods=7, freq='D')

# Create a DataFrame with the generated date range and some example values
data = {'day': date_range, 'value': [10, 15, 20, 25, 30, 35, 40]}
df = pd.DataFrame(data)

# Display the DataFrame
df

Unnamed: 0,day,value
0,2023-01-01,10
1,2023-01-02,15
2,2023-01-03,20
3,2023-01-04,25
4,2023-01-05,30
5,2023-01-06,35
6,2023-01-07,40


In [10]:
df['moving_average'] = df['value'].rolling(window=3).mean()
df

Unnamed: 0,day,value,moving_average
0,2023-01-01,10,
1,2023-01-02,15,
2,2023-01-03,20,15.0
3,2023-01-04,25,20.0
4,2023-01-05,30,25.0
5,2023-01-06,35,30.0
6,2023-01-07,40,35.0
