In [1]:
!pip install --quiet openai pandas

In [17]:
# Install dependencies


# Mount Google Drive
#from google.colab import drive
import pandas as pd
import openai
import numpy as np

#drive.mount('/content/drive')

# 🔁 Set your CSV file path here
#csv_path = '/content/drive/MyDrive/your_folder/your_file.csv'
csv_path = '/content/test.csv'
df = pd.read_csv(csv_path)

# ✅ Sample 100 rows evenly spaced throughout the dataset
sample_size = min(100, len(df))
indices = np.linspace(0, len(df) - 1, sample_size, dtype=int)
sample_df = df.iloc[indices]

# ✅ Convert sample to CSV string
sample_csv = sample_df.to_csv(index=False)

# ✅ Calculate correlation matrix for numeric columns
corr_matrix = df.corr(numeric_only=True).round(2).to_string()

# ✅ Get column data types
column_types = {col: str(dtype) for col, dtype in df.dtypes.items()}
column_types_str = json.dumps(column_types, indent=2)

# ✅ Construct the GPT prompt
prompt1 = f"""
You are a data analyst.

You are given a representative sample of a dataset. This sample has been taken evenly across the full dataset to preserve its structure and trends.

Below is the sample (100 rows):
{sample_csv}

Below is the correlation matrix of numerical columns:
{corr_matrix}

And here are the column data types:
{column_types_str}

Please do the following:
1. Provide a short and descriptive **title** summarizing the overall theme of the dataset.
2. Write a 2–3 sentence **overview** of the dataset structure and key features.
3. Give a bullet list of **6–7 trends, patterns, correlations, or anomalies** based on this data. Focus on numeric variation, relationships, missing data, value distributions, or unusual behaviors.
"""

#----------------------
csv_text = df.to_csv(index=False)
prompt2 = f"""
You are a professional data analyst.

Here is the full dataset in CSV format:
{csv_text}

Please do the following:
1. Provide a clear and descriptive **title** for the dataset.
2. Write a **2–3 sentence overview** describing what this dataset is about and how it's structured.
3. Identify and explain **6–7 key trends, correlations, or patterns** based on this data. You may comment on:
   - Common value distributions
   - Relationships between variables
   - Outliers or anomalies
   - Columns with high or low variability
   - Anything unusual or interesting

Please write clearly and professionally, as if explaining your findings to a colleague who hasn’t seen the data.
"""

prompt3 =  f"""
You are a data analyst tasked with writing a statistical summary report based on the following dataset:

{csv_text}

Write the report using the following structure:
1. Begin with a concise, fact-driven paragraph that summarizes major numerical trends in the data. The tone should be formal, neutral, and impersonal — like a government or UN report. Avoid phrases like "this dataset" or "we found".
2. After the summary paragraph, provide a bullet list of 6–7 key observations. Each bullet should present one notable statistic, regional variation, or change pattern in a factual and detached tone.
3. Use numbers, units, and comparisons (e.g. month-on-month change, percentage differences, regional rankings) whenever possible.
4. Avoid any personal or analytical language — present the information as statistical facts.

The goal is to present a data-driven summary similar to a humanitarian bulletin or economic trends report.
"""


prompt4 = f"""
You are a professional statistical reporter. You are tasked with writing a formal, factual report using quantitative information extracted from the following data:

{csv_text}

Write a concise report with the following structure:

1. Start with a paragraph that presents the main numerical trends, facts, and comparisons. Do not refer to the dataset or say that data was analyzed — present the information as real-world observations or statistics.
2. Follow this with a bullet list of 6–7 specific observations. Each bullet should report a number, ranking, pattern, or change in a neutral, factual tone.
3. Use percentages, ranks, comparisons, and figures where appropriate.
4. Do not use words like "the dataset", "data shows", "analysis reveals", or "according to the data".
5. Present it as if this information is being directly communicated to stakeholders or the public.

Your tone should match that of official UN or economic trend bulletins — factual, impersonal, and clear.
"""


# ✅ Set your OpenAI API key
client = OpenAI(api_key="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")  # 🔐 Replace with your API key

# ✅ Query the OpenAI Chat API
response = client.chat.completions.create(
    model="gpt-3.5-turbo-0125",  # Or "gpt-3.5-turbo" if GPT-4 is unavailable
    messages=[
        {"role": "system", "content": "You are a helpful data analyst."},
        {"role": "user", "content": prompt4}
    ],
    temperature=0.5,
    max_tokens=1000
)

# ✅ Print GPT's response
print("\n🧠 GPT Analysis:")
print(response.choices[0].message.content.strip())



🧠 GPT Analysis:
**Statistical Summary Report: Syrian Arab Republic Market Survey Data**

**Summary:**
The dataset comprises market survey data from various regions in the Syrian Arab Republic, focusing on factors such as MFIRiskLevel, MFIScore, and availability of goods. The average MFIScore across all regions is 7.07, indicating a moderate level of market functionality. Regions exhibit varying levels of risk, with some areas classified as Very High Risk and others as Low Risk. The average availability of goods is 6.94, suggesting a relatively stable supply chain in most regions.

**Key Observations:**
- The region of Homs shows the highest average MFIScore of 8.01, indicating a well-functioning market compared to other regions.
- Rural Damascus has the highest average availability score of 8.33, indicating a consistent supply of goods in that area.
- Market functionality in Aleppo is notably diverse, with some sub-regions scoring as low as 4.0 while others score a perfect 10.0.
- Reg

In [9]:
models = client.models.list()
print([m.id for m in models.data])

['dall-e-3', 'dall-e-2', 'gpt-4o-audio-preview-2024-10-01', 'text-embedding-3-small', 'gpt-4.1-nano', 'gpt-4.1-nano-2025-04-14', 'babbage-002', 'text-embedding-ada-002', 'text-embedding-3-large', 'gpt-4o-mini-audio-preview', 'gpt-4o-audio-preview', 'o1-preview-2024-09-12', 'gpt-4.1-mini', 'gpt-3.5-turbo-instruct-0914', 'gpt-4o-mini-search-preview', 'gpt-4.1-mini-2025-04-14', 'davinci-002', 'gpt-3.5-turbo-1106', 'gpt-3.5-turbo-instruct', 'gpt-3.5-turbo', 'gpt-4o-mini-search-preview-2025-03-11', 'gpt-4o-2024-11-20', 'whisper-1', 'gpt-4o-2024-05-13', 'gpt-3.5-turbo-16k', 'o1-preview', 'gpt-4o-search-preview', 'gpt-4.5-preview', 'gpt-4.5-preview-2025-02-27', 'gpt-4o-search-preview-2025-03-11', 'omni-moderation-2024-09-26', 'tts-1-hd', 'gpt-4o', 'tts-1-hd-1106', 'gpt-4o-mini', 'gpt-4o-2024-08-06', 'gpt-4.1', 'gpt-4o-transcribe', 'gpt-4.1-2025-04-14', 'gpt-4o-mini-2024-07-18', 'gpt-4o-mini-transcribe', 'o1-mini', 'gpt-4o-mini-audio-preview-2024-12-17', 'gpt-3.5-turbo-0125', 'o1-mini-2024-09-

🧠 GPT Analysis:
1. **Title**: Analysis of Market Resilience and Service Quality in Syrian Arab Republic

2. **Overview**: The dataset contains information on market assessments in various regions of Syrian Arab Republic, including market resilience, service quality, assortment, availability, price, competition, and infrastructure. The dataset also includes risk levels associated with each market and survey details.

3. **Trends and Patterns**:
   - There is a strong positive correlation between MFIScore and Infrastructure (0.64) and Quality (0.56), indicating that markets with higher MFIScores tend to have better infrastructure and service quality.
   - Availability and Price have a moderate positive correlation of 0.44, suggesting that as availability increases, prices also tend to increase.
   - Markets with higher Resilience scores tend to have better Infrastructure and Quality, as indicated by the positive correlations of 0.36 and 0.22 respectively.
   - There is a negative correlation between Competition and Access Protection (-0.10), implying that markets with higher competition may have lower access protection measures.
   - The dataset contains missing data in columns like Competition, Infrastructure, Service, Quality, and AccessProtection, indicating potential data collection issues or incomplete survey responses.
   - Markets in the sample vary in risk levels from Low to Very High Risk, with some markets transitioning from Full service to Reduced service over time, indicating changes in market conditions.
   - The SurveyDate column can be further analyzed to identify trends over time in market assessments and risk levels.