In [None]:
import openai

import os
from mistralai import Mistral

api_key = "Use_your_own_damn_key"
model = "mistral-small-latest"


prompt = """
You are an experienced data scientist specializing in data cleaning for large datasets. Below is a sample of a dataset I'm working with:

symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9,111,5000,21,27,13495
3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9,111,5000,21,27,16500
1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9,154,5000,19,26,16500
2,164,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10,102,5500,24,30,13950
2,164,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8,115,5500,18,22,17450
2,?,audi,gas,std,two,sedan,fwd,front,99.8,177.3,66.3,53.1,2507,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250
1,158,audi,gas,std,four,sedan,fwd,front,105.8,192.7,71.4,55.7,2844,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710
1,?,audi,gas,std,four,wagon,fwd,front,105.8,192.7,71.4,55.7,2954,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25,18920
1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,192.7,71.4,55.9,3086,ohc,five,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875

The dataset may have the following issues:
- **Empty or Missing Values:** Some rows have missing values (denoted by “?”). Please identify and count these rows.
- **Duplicate Rows:** Identify if any rows are exact duplicates.
- **Inconsistent Data Formats:** Although there are no date columns, check for inconsistent formats in numerical or categorical data.
- **Outliers and Other Irregularities:** Analyze numerical columns (e.g., price, horsepower) to detect and quantify outliers or anomalous values. Provide specific counts or ranges.

Your task is to:
1. Analyze the dataset and list every potential data quality issue in detail.
2. For each issue, provide:
   - A clear description.
   - The count or proportion of affected rows.
   - Detailed, specific recommendations on the best way to address the issue (for example, whether to remove, impute, or flag the problematic data).
3. Do not include any code; focus solely on analysis and recommendations.
4. Present your response in clear markdown with headings, bullet points, and structured sections.
5. Additionally, explain your reasoning step by step before summarizing your final recommendations.

Please produce a comprehensive data cleaning report as a data scientist would.


"""



client = Mistral(api_key=api_key)

chat_response = client.chat.complete(
    model= model,
    messages = [
        {
            "role": "user",
            "content": prompt,
        },],
    temperature = 0.7)
print(chat_response.choices[0].message.content)

# Data Cleaning Report

## 1. Introduction
This report provides a comprehensive analysis of potential data quality issues present in the dataset. The dataset contains information about various car attributes including symbolic representation, normalized losses, make, fuel type, aspiration, number of doors, body style, drive wheels, engine location, wheel base, length, width, height, curb weight, engine type, number of cylinders, engine size, fuel system, bore, stroke, compression ratio, horsepower, peak RPM, city MPG, highway MPG, and price.

## 2. Data Quality Issues

### 2.1 Missing Values
- **Description:** The dataset contains rows where the `normalized-losses` column has missing values indicated by `?`.
- **Count:** There are 5 rows with missing values in the `normalized-losses` column.
- **Recommendations:** There are several ways to handle missing values in the `normalized-losses` column:
  - **Imputation:** Replace missing values with an appropriate statistic (mean, median, or 