# Data Formats

Data formats play a crucial role in data engineering, as they determine how data is stored, processed, and analyzed. Here are some common data formats used in data engineering:

## CSV (Comma-Separated Values):

CSV is a common format used for storing and exchanging tabular data. In Python, you can use the `csv` module to read and write CSV files. Here's an example of how to read a CSV file and print its contents:

In [None]:
import csv

with open('data.csv', 'r') as file:     # Delimiter param can be set '\t' or tsv
    reader = csv.reader(file)
    for row in reader:
        print(row)

## JSON (JavaScript Object Notation):
JSON is a lightweight data interchange format. In Python, you can use the `json` module to encode and decode JSON data. Here's an example of how to encode a Python dictionary as JSON and write it to a file:

In [None]:
import json

data = {
    "name": "John Smith",
    "age": 30,
    "is_valid": True
}

with open('data.json', 'w') as file:
    json.dump(data, file)

## Other Data Formats are: 

- Parquet
- Avro

# Data Collection

Data collection involves gathering data from various sources, transforming it into a useful format, and storing it for further processing and analysis.

## Web Scraping

Web scraping is the process of extracting data from websites. In Python, you can use the `beautifulsoup4` and `requests` modules to scrape data from HTML pages.

```bash
pip3 install requests beautifulsoup4
```

Here's an example of how to scrape a list of articles from a news website:

In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.example.com/news'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

articles = []
for article in soup.find_all('article'):
    title = article.find('h2').text
    link = article.find('a')['href']
    summary = article.find('p').text
    articles.append({'title': title, 'link': link, 'summary': summary})

print(articles)

## APIs

APIs are interfaces that allow you to access data from other applications or websites. In Python, you can use the requests module to make API requests and retrieve data in JSON format. 

Here's an example of how to retrieve data from the OpenWeatherMap API:

In [None]:
import requests

url = 'https://api.openweathermap.org/data/2.5/weather?q=London,uk&appid=your_api_key'
response = requests.get(url)

data = response.json()
print(data)