# Data Analyst/Scientist Intern Assessment


# Kindly write your full legal name, surname, working phone number and email address into this line

Welcome to the KPI.com Data Analyst/Scientist Intern assessment. This notebook is designed to evaluate your proficiency and skills relevant to the role. Below are the key expectations for this assessment:

## Candidate Expectations:

- **Proficiency in Key Libraries**: You are expected to be proficient in Python libraries such as `pandas`, `numpy`, `scikit-learn`  and `matplotlib`.
- **Code Quality**: Your code should be neat and concise. Strive for clarity and efficiency in your solutions.
- **Use of Functions**: Demonstrate your ability to modularize code by using functions where necessary.

## Application of Skills:

Your data analysis skills will be pivotal for several cutting-edge projects at KPI.com. These include:

- **Building Language-Based Models**: Utilizing models like GPT-3.5/4, Langchain, and LLaMA for various applications.
- **Development of AI-Powered Dashboards**: Creating interactive and informative dashboards to visualize and interpret data.
- **Speech Assistants**: Contributing to the development of AI-driven speech assistants.

This assessment is an opportunity for you to showcase your abilities in a practical context, relevant to the dynamic and innovative projects at KPI.com.


# Task 1: Basic data manipulation

## Objective:
This task is designed to assess your ability to import, manipulate, and analyze a dataset that contains both textual and quantitative data. You will be working with the 'Wine Reviews' dataset, which provides a more realistic and challenging scenario than basic datasets like Iris.

## Instructions:

1. **Data Importation**:
    - Import the Wine Reviews dataset. This dataset can be found on Kaggle and can be imported using Pandas:
      ```python
      import pandas as pd
      wine_reviews = pd.read_csv('path_to_dataset/winemag-data-130k-v2.csv')
      ```
      the dataset should be around 52.91 mb in case you find similar datasets 
    - Note 1: You may need to download the dataset from [Kaggle](https://www.kaggle.com/zynicide/wine-reviews) and adjust the path accordingly.
    - Note 2: For our convinience create adjustable path as a separate variable for the file, such that we can simply replace it while checking
2. **Basic Data Manipulation**:
    - **Explore the Data**: Display the first 5 rows of the dataset to understand its structure.
    - **Data Cleaning**: Check for and handle any missing values in the dataset.
    - **Data Transformation**:
        - Select a subset of columns that includes both text (like review descriptions) and quantitative data (like points or price).
        - Perform a basic transformation, such as normalizing numerical columns of your choice, and explain why this transformation is helpful.

3. **Brief Analysis**:
    - Provide a short analysis of the dataset based on the manipulations performed. Discuss any interesting patterns or insights you find, particularly focusing on how the textual and quantitative data relate to each other.

## Deliverable:
- Lines of codes in a Jupyter notebook containing all executed output for the analyses, along with comments and interpretations of the results.
## Evaluation Criteria:
- Correctness of code.
- Ability to perform basic data manipulation tasks, especially handling both textual and numerical data.
- Clarity and organization of code and comments.

The 'Wine Reviews' dataset offers a rich context for data manipulation and analysis, making it an excellent choice for showcasing your ability to handle diverse data types. This task will demonstrate your foundational skills in data handling, crucial for the advanced data analysis and model building tasks in your role.


In [None]:
# create lines down and write your codes here

# Task 2: Statistical Analysis with the Wine Reviews Dataset

## Objective:
This task aims to evaluate your statistical analysis skills. Using the 'Wine Reviews' dataset, you will perform various statistical analyses to derive insights and understand patterns in the data.

## Instructions:

1. **Statistical Summaries**:
    - Compute summary statistics (mean, median, standard deviation, etc.) for the numerical columns in the dataset (like points or price).
    - Create a brief summary explaining these statistics in the context of the dataset.

2. **Correlation Analysis**:
    - Analyze the correlation between different numerical variables (e.g., price and points).
    - Visualize these correlations using a heatmap or scatter plots.
    - Provide insights on any strong correlations or lack thereof and what they might imply about the dataset.
    - How would you evaluate the impact of the variables to the price of the product?

3. **Optional Advanced Analysis: Hypothesis Testing**:
    - Formulate a hypothesis related to the dataset. For example, "Wines from a particular country have higher average ratings than the global average."
    - Perform appropriate statistical tests to validate or refute your hypothesis.
    - Discuss the results and their implications in the context of wine reviews.
    - If you are comfortable, perform a more advanced statistical analysis of your choice. This could involve regression models, clustering, etc.

## Deliverable:
- Lines of codes in a Jupyter notebook containing all executed output for the analyses, along with comments and interpretations of the results.

## Evaluation Criteria:
- Correct application of statistical methods.
- Depth and relevance of insights derived from the analysis.
- Clarity and organization of code, comments, and interpretations.

In this task, you will apply your statistical knowledge to real-world data, demonstrating your ability to extract meaningful insights from complex datasets. This skill is crucial for the advanced data analysis and predictive modeling tasks in your role.


In [None]:
# create lines down and write your codes here

# Task 3: Statistical Intuition Questions

## Objective:
This task is designed to assess your intuitive understanding of key statistical concepts fundamental to data analysis.

## Questions:

1. **Standardization of Variables**:
   - *Why is it important to standardize variables before performing certain statistical analyses or using specific machine learning algorithms?*

2. **Encoding of Categorical Variables in Regression**:
   - *In regression models, why are categorical variables often encoded as 0 and 1 (binary encoding), instead of using numerical labels like 1, 2, 3, etc.?*

3. **Reliance on Statistical Tables for t-test or f-test**:
   - *Why do we rely on tables (or software functions) to determine critical values or p-values for tests like the t-test or f-test?*

4. **Importance of Data Visualization**:
   - *Why is data visualization an important step in data analysis?*

5. **Handling Missing Data**:
   - *What are some common strategies for handling missing data in a dataset, and why is it important to address missing data appropriately?*

6. **Impact of Outliers**:
   - *How can outliers impact the results of statistical analyses, and what are some ways to handle outliers in your data?*

7. **Choice of Statistical Tests**:
   - *How do you decide which statistical test (e.g., t-test, ANOVA, chi-square test) to use in a given scenario?*

## Deliverable:
- Provide written responses to each question, demonstrating your understanding of these fundamental statistical concepts. Two to five sentences each would be enough, and feel free to use technical English when needed.

## Evaluation Criteria:
- The accuracy and clarity of your explanations.
- Your ability to demonstrate an understanding of basic statistical principles and their practical applications.

This task aims to gauge your foundational knowledge in statistics, which is crucial for making informed decisions in data analysis and ensuring the validity of your results. Your responses will help us understand your capability to apply these concepts in real-world data scenarios.


In [None]:
# create lines down and write your codes here

# Task 4: Text Analysis

## Objective:
This task aims to evaluate your skills in text analysis, which are crucial for working with AI and machine learning tools like PyTorch, TensorFlow, Hugging Face, Langchain, and others. You will be working with a text dataset to perform basic processing and analysis.

## Instructions:

1. **Dataset Selection**:
    - Choose a text dataset that is suitable for analysis. In this task, we decided to give you flexibility to choose the text data on your own. Some popular options include datasets from [Hugging Face Datasets](https://huggingface.co/datasets) or [Kaggle](https://www.kaggle.com/datasets).

2. **Basic Text Processing**:
    - Once you decide on the dataset, import it in some way that is reproducible from our side as well
    - Perform basic text processing tasks such as tokenization, stop word removal, and stemming or lemmatization
    - Feel free to create functions, and briefly document them (no need to create separate file, use docstrings instead)
    - Display the processed text to show the results of your text processing steps.

3. **Text Analysis**:
    - Compute the frequency distribution of words or phrases in the dataset.
    - Identify and visualize the most common words or phrases.
    - Perform sentiment analysis on a subset of reviews, if the dataset contains review or opinion texts.

4. **Insight Generation**:
    - In two or three sentences, explain why did you find this particular dataset interesting
    - Based on your text analysis, generate insights about the dataset. Discuss any interesting patterns or findings you observe.

## Deliverable:
- A chapter in Jupyter notebook containing all executed code for the text processing and analysis tasks, along with comments and interpretations of the results.

## Evaluation Criteria:
- Correct implementation of text processing and analysis techniques.
- Depth and relevance of insights derived from the text analysis.
- Clarity and organization of code, comments, and interpretations.

Through this task, you will demonstrate your ability to handle and analyze text data, which is a foundational skill for working with advanced AI and machine learning technologies. Your proficiency in extracting meaningful information from text will be crucial for the innovative projects at KPI.com.


In [None]:
# create lines down and write your codes here

# End of Assessment Notes

Thank you for participating in this assessment. As you conclude, please keep in mind the following key points:

## Clarity and Readability of Code:
- Your code should be clear and easy to understand. This includes proper indentation, meaningful variable names, and thoughtful organization of your code.
- Clear and readable code is crucial for effective teamwork and maintainability of projects.

## Time Allocation:
- You are expected to spend approximately 3-4 hours on this assessment. This timeframe is designed to balance thoroughness with efficiency, reflecting real-world work scenarios.

## Use of Markdown for Written Answers:
- Please provide your written answers in Markdown format. This helps in maintaining a consistent and readable format for your responses.

## Commentary and Documentation:
- Feel free to add comments throughout your code. This practice is highly encouraged as it provides insight into your thought process and makes it easier for us to understand your approach.
- Effective commenting and documentation are key skills in a collaborative and professional environment.

## Focus on Statistical Understanding:
- Although this is a position for a Data Analyst/Scientist Intern, a strong foundation in statistics is essential. You will be expected to learn and apply machine learning concepts during your internship.
- We are looking for candidates who demonstrate a rigorous understanding of statistics, as this is fundamental to success in machine learning and advanced data analysis.

## Final Thoughts:
- This assessment is an opportunity for you to showcase your technical skills, problem-solving abilities, and your approach to learning and collaboration.
- We appreciate the effort and time you put into this assessment and are excited to see your work.

Best of luck, and we look forward to reviewing your submission!


# END OF THE ASSESSMENT
