# LangChain basics with football stats in Italy

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1rDqka8UCUheFtqWHH6jQwIgyzMYuA_sI?usp=sharing)

In [1]:
#install all proper packages
!pip install -qU langchain==0.1.5 langchain-community==0.0.17 langchain-core==0.1.18 langchain-openai==0.0.5 openai==1.11.0 langchain_experimental tiktoken==0.5.2 chromadb==0.4.22 pandas==2.2.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m806.7/806.7 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m237.0/237.0 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.0/226.0 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.4/199.4 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

In [2]:
!pip freeze | grep langchain

langchain==0.1.5
langchain-community==0.0.17
langchain-core==0.1.18
langchain-experimental==0.0.51
langchain-openai==0.0.5


## Data Download
Firstly, we will fetch the data from a website containing information about the official public holidays in the UAE for this year. To work with our own data, we will save the table as a CSV file and later load it using the `CSVLoader`. Theoretically, one can use `WebCrawler` instead of a custom function or include our function in a tool.

In [3]:
import requests
import bs4
import pandas as pd
import pprint

In [4]:
from google.colab import userdata
import os

In [5]:
# may include html table here
html = """

"""

In [6]:
# Function to make HTTP GET request
def get_request(url, cookies={}, headers={}):
    headers = {
      'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    }

    return requests.get(url, cookies=cookies, headers=headers)

# Function to collect data from a URL and extract the table
def collect_data(url):
    response = get_request(url)
    soup = bs4.BeautifulSoup(response.text, features="lxml")
    table = soup.find("table",  class_ = "standing-table__table")
    # if blocked by some cloudfare
    if not table:
      soup = bs4.BeautifulSoup(html, features="lxml")
      table = soup.find("table",  class_ = "standing-table__table")
    return table

# Function to convert HTML table to pandas DataFrame
def convert_html_table_to_df(html_text):
    return pd.read_html(str(html_text))[0]

In [7]:
# Root URL for the website containing holiday data
ROOT_URL = "https://www.skysports.com/serie-a-table/2022"

# Collect the data and convert it to a DataFrame
html_text = collect_data(url=ROOT_URL)
df = convert_html_table_to_df(html_text=html_text)

  return pd.read_html(str(html_text))[0]


- Pl (Played): The number of matches the team has played so far in the season.
- W (Wins): The number of matches the team has won.
- D (Draws): The number of matches that have ended in a draw (neither team won).
- L (Losses): The number of matches the team has lost.
- F (Goals For): The total number of goals the team has scored.
- A (Goals Against): The total number of goals the team has conceded.
- GD (Goal Difference): The difference between goals scored (F) and goals conceded (A), calculated as Goals For minus Goals Against (F - A). This is often used as a tiebreaker in rankings.
- Pts (Points): The total number of points the team has earned. Typically, teams earn 3 points for a win, 1 point for a draw, and 0 points for a loss.

In [8]:
df.head()

Unnamed: 0,#,Team,Pl,W,D,L,F,A,GD,Pts,Last 6
0,1,Napoli,38,28,6,4,77,28,49,90,
1,2,Lazio,38,22,8,8,60,30,30,74,
2,3,Inter Milan,38,23,3,12,71,42,29,72,
3,4,AC Milan,38,20,10,8,64,43,21,70,
4,5,Atalanta,38,19,7,12,66,48,18,64,


In [9]:
# To insure the data is read only from the file, let's add a new team
new_row = pd.DataFrame({'#':[42],'Team': ["LLMia"], 'Pl': [38], 'W': [24], 'D': [6], 'L': [8], 'F': [48], 'A': [20], 'GD': [28], 'Pts': [78]})

# Position to insert the new row (0-indexed, so 1 is actually the 2nd row)
position = 1

# Split the DataFrame into two parts
df1 = df.iloc[:position]
df2 = df.iloc[position:]

# Concatenate the first part, new row, and second part
df = pd.concat([df1, new_row, df2]).reset_index(drop=True)
df.drop(["#"], axis=1, inplace=True)

In [10]:
df

Unnamed: 0,Team,Pl,W,D,L,F,A,GD,Pts,Last 6
0,Napoli,38,28,6,4,77,28,49,90,
1,LLMia,38,24,6,8,48,20,28,78,
2,Lazio,38,22,8,8,60,30,30,74,
3,Inter Milan,38,23,3,12,71,42,29,72,
4,AC Milan,38,20,10,8,64,43,21,70,
5,Atalanta,38,19,7,12,66,48,18,64,
6,Roma,38,18,9,11,50,38,12,63,
7,Juventus *,38,22,6,10,56,33,23,62,
8,Fiorentina,38,15,11,12,53,43,10,56,
9,Bologna,38,14,12,12,53,49,4,54,


In [11]:
# Save the DataFrame to a CSV file
df.iloc[:-1, :].to_csv("results_22-23.csv")

## LangChain
Now, we will import several LangChain methods that we will be utilizing. For the purposes of this demo, we will begin with a straightforward approach using the `ChatOpenAI` model. To achieve this, we will load the previously saved file and create a vector index from its contents. Additionally, we will create a simple prompt and set up a memory to store the conversation history. Finally, we will configure a `RetrievalQA` chain to bring all these components together.

In [12]:
# Set OpenAI API key from Google Colab's user environment or default
def set_openai_api_key(default_key: str = "YOUR_API_KEY") -> None:
    """Set the OpenAI API key from Google Colab's user environment or use a default value."""

    os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY") or default_key


set_openai_api_key()

In [13]:
'''
# Load language model, embeddings, and index for conversational AI
from langchain.chat_models import ChatOpenAI                #model
from langchain.indexes import VectorstoreIndexCreator       #index
from langchain.document_loaders.csv_loader import CSVLoader #tool
from langchain.prompts import PromptTemplate                #prompt
from langchain.memory import ConversationBufferMemory       #memory
from langchain.chains import RetrievalQA                    #chain

#import langchain
#langchain.verbose = True
'''


'\n# Load language model, embeddings, and index for conversational AI\nfrom langchain.chat_models import ChatOpenAI                #model\nfrom langchain.indexes import VectorstoreIndexCreator       #index\nfrom langchain.document_loaders.csv_loader import CSVLoader #tool\nfrom langchain.prompts import PromptTemplate                #prompt\nfrom langchain.memory import ConversationBufferMemory       #memory\nfrom langchain.chains import RetrievalQA                    #chain\n\n#import langchain\n#langchain.verbose = True\n'

In [14]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts.chat import ChatPromptTemplate, MessagesPlaceholder
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda
from langchain.agents import AgentExecutor
from langchain_core.output_parsers import StrOutputParser
from langchain.agents.format_scratchpad import format_to_openai_function_messages
from langchain.agents.output_parsers.openai_functions import OpenAIFunctionsAgentOutputParser
from langchain.indexes import VectorstoreIndexCreator
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.memory import ConversationSummaryBufferMemory, ChatMessageHistory

from operator import itemgetter

In [15]:
model_name = "gpt-4-0125-preview"
MEMORY_KEY = "chat_history"
verbose= False

In [16]:
# Create a prompt using the template
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are Luigi. You are a assistant to help answer football questions, based only only on the data provided. Don't use other data sources, only use given tools!
            If asked any question about Serie A results - use the PandasDataFrameAgent tool!
            If asked regarding the history of italian football - use the search_for_football_history tool!
            Here is the soccer league table with explanations for each column:
            Pl (Played): The number of matches the team has played so far in the season.
            W (Wins): The number of matches the team has won.
            D (Draws): The number of matches that have ended in a draw (neither team won).
            L (Losses): The number of matches the team has lost.
            F (Goals For): The total number of goals the team has scored.
            A (Goals Against): The total number of goals the team has conceded.
            GD (Goal Difference): The difference between goals scored (F) and goals conceded (A), calculated as Goals For minus Goals Against (F - A). This is often used as a tiebreaker in rankings.
            Pts (Points): The total number of points the team has earned. Typically, teams earn 3 points for a win, 1 point for a draw, and 0 points for a loss.
            """,
        ),
        MessagesPlaceholder(variable_name=MEMORY_KEY),
        ("user", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ]
)

In [17]:
# Creating memory

history = ChatMessageHistory()
history.add_user_message("Hi, my name is Ivan.")
history.add_ai_message("Hello Ivan! How can I assist you today?")

memory = ConversationSummaryBufferMemory(
    llm=ChatOpenAI(model=model_name),
    return_messages=True,
    memory_key=MEMORY_KEY,
    chat_memory=history
)

In [18]:
print(memory.load_memory_variables({}))
memory.predict_new_summary(existing_summary="", messages=history.messages)

{'chat_history': [HumanMessage(content='Hi, my name is Ivan.'), AIMessage(content='Hello Ivan! How can I assist you today?')]}


'Ivan greets the AI, and the AI asks how it can assist him today.'

In [19]:
import pprint
text = '''

Italian football has a storied history that dates back to the late 19th century, with the sport gaining significant popularity in the early 20th century. The formation of the Italian Football Federation in 1898 marked the beginning of organized football competitions in Italy. The 1930s were a golden era for Italian football, with the national team, known as the Azzurri, achieving remarkable success on the international stage. Under the guidance of coach Vittorio Pozzo, Italy won the FIFA World Cup twice, first in 1934 on home soil and again in 1938 in France. Key players like Luigi and Mario played crucial roles in these triumphs, displaying exceptional skill and teamwork that cemented Italy's reputation as a football powerhouse.

The post-war period saw the Italian Serie A league become one of the most competitive and prestigious in the world, attracting top talents from around the globe. Clubs like Juventus, AC Milan, and Inter Milan dominated both domestically and in European competitions. The 1980s and 1990s witnessed Italian clubs winning numerous UEFA Champions League titles, with legendary players such as Paolo Maldini, Roberto Baggio, and Alessandro Del Piero becoming household names. The tactical discipline and defensive prowess known as "Catenaccio" became synonymous with Italian football, contributing to the national team's success in reaching the World Cup final in 1994, though they were narrowly defeated by Brazil.

The 21st century brought a new era for Italian football, with the national team achieving its fourth World Cup victory in 2006. Under the management of Marcello Lippi, the Azzurri displayed resilience and tactical acumen throughout the tournament held in Germany. Key players like Fabio Cannavaro, Gianluigi Buffon, and Andrea Pirlo were instrumental in Italy's success, leading the team to a memorable final against France, which was decided by a penalty shootout after a 1-1 draw. This victory not only reestablished Italy as a dominant force in world football but also inspired a new generation of players and fans. Despite facing challenges in subsequent years, Italian football continues to produce world-class talent and remains an integral part of the country's cultural and sporting identity.

'''

pprint.pprint(text)

('\n'
 '\n'
 'Italian football has a storied history that dates back to the late 19th '
 'century, with the sport gaining significant popularity in the early 20th '
 'century. The formation of the Italian Football Federation in 1898 marked the '
 'beginning of organized football competitions in Italy. The 1930s were a '
 'golden era for Italian football, with the national team, known as the '
 'Azzurri, achieving remarkable success on the international stage. Under the '
 'guidance of coach Vittorio Pozzo, Italy won the FIFA World Cup twice, first '
 'in 1934 on home soil and again in 1938 in France. Key players like Luigi and '
 'Mario played crucial roles in these triumphs, displaying exceptional skill '
 "and teamwork that cemented Italy's reputation as a football powerhouse.\n"
 '\n'
 'The post-war period saw the Italian Serie A league become one of the most '
 'competitive and prestigious in the world, attracting top talents from around '
 'the globe. Clubs like Juventus, AC Milan

In [20]:
from langchain.schema.document import Document

# Creating vectorstore

def load_index():
    # if you want to avoid the step of saving/loading a file, you can use the `from_documents()` method of the VectorstoreIndexCreator()
    #loader = CSVLoader(file_path='results_22-23.csv')
    #index = VectorstoreIndexCreator().from_loaders([loader])
    docs = [Document(page_content=x) for x in text.split('/n')]
    index = VectorstoreIndexCreator().from_documents(docs)
    return index

retriever=load_index().vectorstore.as_retriever()

  warn_deprecated(


In [21]:
# Import things that are needed generically
from langchain_openai import OpenAI
from langchain.pydantic_v1 import BaseModel, Field
from langchain.tools import BaseTool, StructuredTool, tool, format_tool_to_openai_function, Tool
from langchain.tools.retriever import create_retriever_tool
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent


pandas_agent = create_pandas_dataframe_agent(ChatOpenAI(temperature=0,model_name=model_name), df, verbose=True, handle_parsing_errors=True)

# Define the tool using the agent
pandas_tool = Tool(
    name="PandasDataFrameAgent",
    func=pandas_agent.run,
    description="A tool to query football statistics from a DataFrame regading Serie A 2022/23 results"
)

retrive_tool = create_retriever_tool(
    retriever,
    "search_for_football_history",
    "Returns information regarding the history of football in Italy",
)

tools = [pandas_tool, retrive_tool]
formatted_functions = [format_tool_to_openai_function(t) for t in tools]

  warn_deprecated(


In [22]:
def load_llm():
    llm = ChatOpenAI(temperature=0,model_name=model_name)
    return llm

llm = load_llm()
llm_with_tools = llm.bind(functions=formatted_functions)

In [23]:
agent = (
    {
        "input": lambda x: x["input"],
        "agent_scratchpad": lambda x: format_to_openai_function_messages(
            x["intermediate_steps"]
        ),
    }
    | RunnablePassthrough.assign(
        **{MEMORY_KEY: RunnableLambda(memory.load_memory_variables) | itemgetter(MEMORY_KEY)}
    )
    | prompt
    | llm_with_tools
    | OpenAIFunctionsAgentOutputParser()
)

qa = AgentExecutor(agent=agent, tools=tools, memory=memory, verbose=verbose, handle_parsing_errors=True)

## Q&A
Let's now ask some questions regarding the holidays in UAE:

In [24]:
# Function to print the response for a given query
def print_response_for_query(query):
    result = qa.invoke({"input": query})['output']
    output_format = f"""===================================
    {result}
    ==================================="""
    return pprint.pprint(output_format)

### Asking questions

In [25]:
query = "Who became the runner-up of the Premier League in 2022/23?"
print_response_for_query(query)

 "    I'm here to help with Serie A results and the history of Italian "
 'football. For information about the Premier League, I recommend checking a '
 'reliable sports news website or the official Premier League website for the '
 'most accurate and up-to-date information. Is there anything else related to '
 'Serie A or Italian football history I can help you with?\n'


Correct response. What about Serie A?

In [26]:
query = "Sorry, I meant in Serie A"
print_response_for_query(query)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find the runner-up for the 2022/23 season, I need to identify the team with the second highest number of points (`Pts`) in the dataframe `df`.

Action: python_repl_ast
Action Input: df.sort_values(by='Pts', ascending=False).iloc[1]['Team'][0m[36;1m[1;3mLLMia[0m[32;1m[1;3mI now know the final answer.
Final Answer: The runner-up for the 2022/23 season is LLMia.[0m

[1m> Finished chain.[0m
 '    In the Serie A 2022/23 season, the runner-up was LLMia. If you have any '
 'more questions or need further information, feel free to ask!\n'


Did you notice, how we used the **memory** here? If it wasn't for it, the response would've sounded as:
> Sorry, I can't understand you. What exactly do you want to know about Serie A?

In [27]:
query = "What place Lazio took?"
print_response_for_query(query)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: The question seems to be asking for information about "Lazio" from the dataframe. I need to extract the row corresponding to "Lazio" and present its details.

Action: python_repl_ast
Action Input: df[df['Team'] == 'Lazio'][0m[36;1m[1;3m    Team  Pl   W  D  L   F   A  GD  Pts  Last 6
2  Lazio  38  22  8  8  60  30  30   74     NaN[0m[32;1m[1;3mI now know the final answer
Final Answer: For the team "Lazio", the details are as follows: Played (Pl) = 38, Wins (W) = 22, Draws (D) = 8, Losses (L) = 8, Goals For (F) = 60, Goals Against (A) = 30, Goal Difference (GD) = 30, Points (Pts) = 74, Last 6 games' points (Last 6) = NaN.[0m

[1m> Finished chain.[0m
 '    In the Serie A 2022/23 season, Lazio finished in 3rd place with 74 '
 'points. They played 38 matches, winning 22, drawing 8, and losing 8. They '
 'scored 60 goals and conceded 30, resulting in a goal difference of +30. If '
 'you have any more questions or 

### Multichain example

In [28]:
df.head(3)

Unnamed: 0,Team,Pl,W,D,L,F,A,GD,Pts,Last 6
0,Napoli,38,28,6,4,77,28,49,90,
1,LLMia,38,24,6,8,48,20,28,78,
2,Lazio,38,22,8,8,60,30,30,74,


In [29]:
query = "Who scored the least amount of goals in the top 3?"
print_response_for_query(query)

 '    In the Serie A 2022/23 season, among the top 3 teams, Lazio scored the '
 'least amount of goals. They scored 60 goals. If you have any more questions '
 'or need further information, feel free to ask!\n'


In [30]:
query = "How many teams scored less?"
print_response_for_query(query)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: The question seems to be asking for the total number of goals scored (goals for) by all teams in the dataframe. To find this, I need to sum the values in the 'F' column of the dataframe.

Action: python_repl_ast
Action Input: df['F'].sum()[0m[36;1m[1;3m1022[0m[32;1m[1;3mI now know the final answer.
Final Answer: The total number of goals scored (goals for) by all teams in the dataframe is 1022.[0m

[1m> Finished chain.[0m


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find the teams with less than 60 goals, I need to filter the dataframe `df` where the column 'F' (which represents goals scored) is less than 60.

Action: python_repl_ast
Action Input: df[df['F'] < 60]['Team'][0m[36;1m[1;3m1           LLMia
6            Roma
7      Juventus *
8      Fiorentina
9         Bologna
10         Torino
11          Monza
12        Udinese
13       Sassuolo
14         Empoli
15    Salernitan

In [31]:
df[df['F'] < 48]

Unnamed: 0,Team,Pl,W,D,L,F,A,GD,Pts,Last 6
10,Torino,38,14,11,13,42,41,1,53,
12,Udinese,38,11,13,14,47,48,-1,46,
13,Sassuolo,38,12,9,17,47,61,-14,45,
14,Empoli,38,10,13,15,37,49,-12,43,
16,Lecce,38,8,12,18,33,46,-13,36,
17,Spezia,38,6,13,19,31,62,-31,31,
18,Verona,38,7,10,21,31,59,-28,31,
19,Cremonese,38,5,12,21,36,69,-33,27,
20,Sampdoria,38,3,10,25,24,71,-47,19,


### Some calculations

In [32]:
query = "What is the average goals scored for Juventus?"
print_response_for_query(query)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: Since the dataframe `df` is not directly accessible, I cannot compute the average goals scored for Juventus directly. However, I can provide a general approach on how to calculate the average goals scored for a specific team using pandas in Python.
Final Answer: To calculate the average goals scored for Juventus (or any team) in a pandas dataframe, you can use the following approach:

```python
average_goals_scored = df[df['Team'] == 'Juventus']['F'].mean()
```

This code filters the dataframe for rows where the 'Team' column matches 'Juventus', then selects the 'F' column (which represents goals scored), and finally calculates the mean of that column. Note that this approach assumes you have a dataframe `df` with the same structure as the one you described.[0m

[1m> Finished chain.[0m


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: The question "Juventus" is not clear in its context. It seem

Notice that the team Juventus contains a * symbol, what prevents to find it among the teams. Clean data always leads to better results

In [33]:
df[df['Team'].str.contains('Juventus')]

Unnamed: 0,Team,Pl,W,D,L,F,A,GD,Pts,Last 6
7,Juventus *,38,22,6,10,56,33,23,62,


### Text retrieval

In [34]:
query = "Who were the most recognized players in Italian football in the 1930s?"
print_response_for_query(query)



 "    In the 1930s, Italian football was highlighted by the national team's "
 'remarkable achievements, particularly winning the FIFA World Cup in 1934 and '
 'again in 1938 under the guidance of coach Vittorio Pozzo. While the response '
 'provided does not specify individual players from the 1930s, it mentions the '
 'era as a golden period for Italian football, with players like Luigi and '
 'Mario being key figures in these triumphs. This might be a placeholder text, '
 'as specific names like Luigi and Mario in this context do not directly '
 'correlate to well-known players of that era.\n'
 '\n'
 'However, some of the most recognized Italian footballers from the 1930s who '
 "contributed to Italy's World Cup victories and were renowned for their skill "
 'include Giuseppe Meazza, who was one of the stars of the Italian national '
 "team, and Silvio Piola, another legendary figure. Meazza's contributions "
 'were pivotal in both World Cup victories, and he is considered one of th

In [35]:
query = "Who were the most recognized players in Italian football in the modern era?"
print_response_for_query(query)



 '    In the modern era of Italian football, several players have gained '
 'recognition for their exceptional skills, contributions to their clubs, and '
 'performances on the international stage. Here are some of the most '
 'celebrated Italian footballers from recent decades:\n'
 '\n'
 '- **Paolo Maldini**: A legendary figure for AC Milan and the Italian '
 'national team, Maldini is widely regarded as one of the greatest defenders '
 'in the history of football. His career spanned over two decades, during '
 'which he won numerous domestic and international titles.\n'
 '\n'
 '- **Roberto Baggio**: Known for his creativity, technical skill, and scoring '
 'ability, Baggio is considered one of the best Italian players of all time. '
 'He played for several top Italian clubs and was a key player for the '
 'national team, notably during the 1994 FIFA World Cup.\n'
 '\n'
 '- **Alessandro Del Piero**: Del Piero spent the majority of his career at '
 "Juventus, becoming the club's all-ti