### A general description of each column typically found in the Titanic dataset based on common knowledge:

1. **PassengerId**: A unique identifier for each passenger.
2. **Survived**: Binary indicator of survival (0 = No, 1 = Yes).
3. **Pclass**: Passenger class (1 = First, 2 = Second, 3 = Third).
4. **Name**: Name of the passenger.
5. **Sex**: Gender of the passenger (male/female).
6. **Age**: Age of the passenger in years.
7. **SibSp**: Number of siblings/spouses aboard the Titanic.
8. **Parch**: Number of parents/children aboard the Titanic.
9. **Ticket**: Ticket number.
10. **Fare**: Passenger fare (monetary amount).
11. **Cabin**: Cabin number.
12. **Embarked**: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

### Typical Data Characteristics:
- **Survived**: A categorical column often used to analyze survival rates based on different features.
- **Pclass**: Another categorical feature that shows a correlation with survival rates, often indicating socioeconomic status.
- **Age**: Often contains missing values; crucial for analysis related to age demographics.
- **Sex**: Commonly analyzed to see survival rate differences between genders.
- **Fare**: Can show a wide range, reflecting different ticket prices; often analyzed in relation to passenger class and survival.
- **Embarked**: Often used to explore geographic and socio-economic differences among passengers.

If you can upload the CSV file or provide a different link, I can provide more specific insights based on the actual data.

In [44]:
import pandas as pd
url = 'https://raw.githubusercontent.com/rajeevratan84/datascienceforbusiness/master/titanic.csv'
df = pd.read_csv(url)

### Find the maximum and minimum fare paid by passengers.

In [45]:
max_fare = df['fare'].max()
min_fare = df['fare'].min()

print('Max fare: ', max_fare)

Max fare:  512.3292


### Identify the passengers who paid more than 100 in fare.

In [46]:
high_fare_passengers = df[df['fare'] > 100]

print('High fare passengers: ', high_fare_passengers)

High fare passengers:       pclass                                             name     sex      age  \
0         1                    Allen, Miss. Elisabeth Walton  female  29.0000   
1         1                   Allison, Master. Hudson Trevor    male   0.9167   
2         1                     Allison, Miss. Helen Loraine  female   2.0000   
3         1             Allison, Mr. Hudson Joshua Creighton    male  30.0000   
4         1  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female  25.0000   
..      ...                                              ...     ...      ...   
312       1                       Widener, Mr. George Dunton    male  50.0000   
313       1                        Widener, Mr. Harry Elkins    male  27.0000   
314       1     Widener, Mrs. George Dunton (Eleanor Elkins)  female  50.0000   
319       1                        Wilson, Miss. Helen Alice  female  31.0000   
322       1                         Young, Miss. Marie Grice  female  36.0000   

    

### Find the median age of passengers.

In [47]:
median_age = df['age'].median()

print('Median age: ', median_age)

Median age:  28.0


### Determine the number of passengers who traveled alone (no siblings/spouses and no parents/children).

In [48]:
solo_travelers = df[(df['sibsp'] == 0) & (df['parch'] == 0)]
num_solo_travelers = solo_travelers.shape[0]

print('Number of solo travelers: ', num_solo_travelers)

Number of solo travelers:  790


### Calculate the average fare paid by passengers in each embarkation point.

In [49]:
avg_fare_by_embarkation = df.groupby('embarked')['fare'].mean()

print('Average fare by embarkation: ', avg_fare_by_embarkation)

Average fare by embarkation:  embarked
C    62.336267
Q    12.409012
S    27.418824
Name: fare, dtype: float64


### Find the correlation between 'Pclass' and 'Survived'.

In [50]:
pclass_survived_corr = df['pclass'].corr(df['survived'])

print('Correlation between pclass and survived: ', pclass_survived_corr)

Correlation between pclass and survived:  -0.31246936264967606


### Identify passengers whose age is above 70.

In [51]:
elderly_passengers = df[df['age'] > 70]

print('Elderly passengers: ', elderly_passengers)

Elderly passengers:        pclass                                               name     sex   age  \
9          1                            Artagaveytia, Mr. Ramon    male  71.0   
14         1               Barkworth, Mr. Algernon Henry Wilson    male  80.0   
61         1  Cavendish, Mrs. Tyrell William (Julia Florence...  female  76.0   
135        1                          Goldschmidt, Mr. George B    male  71.0   
727        3                               Connors, Mr. Patrick    male  70.5   
1235       3                                Svensson, Mr. Johan    male  74.0   

      sibsp  parch    ticket     fare cabin embarked  survived  
9         0      0  PC 17609  49.5042   NaN        C         0  
14        0      0     27042  30.0000   A23        S         1  
61        1      0     19877  78.8500   C46        S         1  
135       0      0  PC 17754  34.6542    A5        C         0  
727       0      0    370369   7.7500   NaN        Q         0  
1235      0      0   

### Count the number of passengers for each unique 'Ticket' value.

In [52]:
ticket_counts = df['ticket'].value_counts()

print('Ticket counts: ', ticket_counts)

Ticket counts:  CA. 2343    11
1601         8
CA 2144      8
PC 17608     7
347077       7
            ..
373450       1
2223         1
350046       1
3101281      1
315082       1
Name: ticket, Length: 929, dtype: int64


### Calculate the average 'Fare' for male and female passengers.

In [53]:
avg_fare_by_sex = df.groupby('sex')['fare'].mean()

avg_fare_by_sex 

sex
female    46.198097
male      26.154601
Name: fare, dtype: float64

### Extract passengers with names starting with a particular letter, e.g., 'A'.

In [54]:
passengers_with_A = df[df['name'].str.startswith('A')]

passengers_with_A 

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,survived
0,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,1
1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,1
2,1,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,C22 C26,S,0
3,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,C22 C26,S,0
4,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,C22 C26,S,0
...,...,...,...,...,...,...,...,...,...,...,...
649,3,"Assam, Mr. Ali",male,23.0000,0,0,SOTON/O.Q. 3101309,7.0500,,S,0
650,3,"Attalah, Miss. Malake",female,17.0000,0,0,2627,14.4583,,C,0
651,3,"Attalah, Mr. Sleiman",male,30.0000,0,0,2694,7.2250,,C,0
652,3,"Augustsson, Mr. Albert",male,23.0000,0,0,347468,7.8542,,S,0


### Find the standard deviation of the 'Fare' column.

In [55]:
fare_std_dev = df['fare'].std()

fare_std_dev 

51.758668239174135

### Create a column 'FareCategory' that categorizes fares into 'Low', 'Medium', and 'High'.

In [56]:
def categorize_fare(fare):
    if fare < 50:
        return 'Low'
    elif fare < 100:
        return 'Medium'
    else:
        return 'High'
df['FareCategory'] = df['fare'].apply(categorize_fare)

In [57]:
df

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,survived,FareCategory
0,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,1,High
1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,1,High
2,1,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,C22 C26,S,0,High
3,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,C22 C26,S,0,High
4,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,C22 C26,S,0,High
...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,,C,0,Low
1305,3,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,0,Low
1306,3,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,,C,0,Low
1307,3,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,,C,0,Low


### Determine the number of unique cabin numbers.

In [58]:
unique_cabins = df['cabin'].nunique()
unique_cabins 

186

### Count the number of passengers who have a cabin number assigned.

In [59]:
passengers_with_cabin = df['cabin'].notnull().sum()
passengers_with_cabin 

295

### Find the average age of survivors and non-survivors.

In [60]:
avg_age_by_survival = df.groupby('survived')['age'].mean()
avg_age_by_survival

survived
0    30.545369
1    28.918228
Name: age, dtype: float64

### Calculate the total fare collected from all passengers.

In [61]:
total_fare_collected = df['fare'].sum()
total_fare_collected

43550.4869

### Find the proportion of passengers in each class who survived.

In [62]:
survival_rate_by_class = df.groupby('pclass')['survived'].mean()
survival_rate_by_class

pclass
1    0.619195
2    0.429603
3    0.255289
Name: survived, dtype: float64

### Identify the top 5 passengers who paid the highest fare.

In [63]:
top_fare_payers = df.nlargest(5, 'fare')
top_fare_payers

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,survived,FareCategory
49,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C,1,High
50,1,"Cardeza, Mrs. James Warburton Martinez (Charlo...",female,58.0,0,1,PC 17755,512.3292,B51 B53 B55,C,1,High
183,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C,1,High
302,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C,1,High
111,1,"Fortune, Miss. Alice Elizabeth",female,24.0,3,2,19950,263.0,C23 C25 C27,S,1,High


### Find the most common first letter of the passenger names.

In [64]:
most_common_initial = df['name'].str[0].mode()[0]
most_common_initial

'S'

### Calculate the average number of siblings/spouses aboard for each passenger class.

In [65]:
avg_sibsp_by_class = df.groupby('pclass')['sibsp'].mean()
avg_sibsp_by_class

pclass
1    0.436533
2    0.393502
3    0.568406
Name: sibsp, dtype: float64

In [66]:
import pandas as pd
from pandasai import SmartDataframe
from pandasai.prompts.pandasai import get_prompt

# Load the dataset
url = 'https://raw.githubusercontent.com/rajeevratan84/datascienceforbusiness/master/titanic.csv'
df = pd.read_csv(url)

# Convert to SmartDataframe
smart_df = SmartDataframe(df)

# Example question and answer
question = "What is the average fare paid by passengers in each embarkation point?"
answer = smart_df.ask(question)

print(answer)


ModuleNotFoundError: No module named 'pandasai.prompts.pandasai'

In [71]:
import os
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

True

In [81]:
import os
import pandas as pd
from langchain_openai import OpenAI
from pandasai import SmartDataframe

llm =  OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

In [82]:
import os
import pandas as pd
from langchain_openai import OpenAI
from pandasai import SmartDataframe

# Ensure the OpenAI API key is set in your environment variables
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("OPENAI_API_KEY environment variable is not set.")


# Load the Titanic dataset
url = 'https://raw.githubusercontent.com/rajeevratan84/datascienceforbusiness/master/titanic.csv'
df = pd.read_csv(url)

# Convert to SmartDataframe
smart_df =  SmartDataframe(df, config={"llm": llm})

# Example questions
questions = [
    "What is the average fare paid by passengers?",
#     "How many passengers survived?",
#     "What is the correlation between age and fare?",
#     "Which passengers paid more than 100 in fare?",
#     "What is the median age of passengers?"
 ]

# Get answers for each question
for question in questions:
    print(f"Question: {question}")
    answer = smart_df.chat(question)
    print(f"Answer: {answer}\n")


Question: What is the average fare paid by passengers?
Answer: Unfortunately, I was not able to answer your question, because of the following error:

Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-*********************************************HAH1. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}




Traceback (most recent call last):
  File "/Users/tuchsanai/anaconda3/envs/torch/lib/python3.10/site-packages/pandasai/pipelines/chat/generate_chat_pipeline.py", line 335, in run
    ).run(input)
  File "/Users/tuchsanai/anaconda3/envs/torch/lib/python3.10/site-packages/pandasai/pipelines/pipeline.py", line 137, in run
    raise e
  File "/Users/tuchsanai/anaconda3/envs/torch/lib/python3.10/site-packages/pandasai/pipelines/pipeline.py", line 101, in run
    step_output = logic.execute(
  File "/Users/tuchsanai/anaconda3/envs/torch/lib/python3.10/site-packages/pandasai/pipelines/chat/code_generator.py", line 33, in execute
    code = pipeline_context.config.llm.generate_code(input, pipeline_context)
  File "/Users/tuchsanai/anaconda3/envs/torch/lib/python3.10/site-packages/pandasai/llm/base.py", line 201, in generate_code
    response = self.call(instruction, context)
  File "/Users/tuchsanai/anaconda3/envs/torch/lib/python3.10/site-packages/pandasai/llm/langchain.py", line 55, in call
