# Introduction to Data Science



## 1.What is Data Science?
Data Science is an **interdisciplinary field** that combines various techniques, algorithms, and processes to extract meaningful insights, knowledge, and value from structured and unstructured data. It involves the use of scientific methods, processes, algorithms, and systems to analyze and interpret data, and it’s essential for solving complex problems and making data-driven decisions.


<img src="https://jaro-website.s3.ap-south-1.amazonaws.com/2024/04/1-mgXvzNcwfpnBawI6XTkVRg.webp" alt="Interdisciplinary field" width="300" height="300" style="display: block; margin: 0 auto"/>


You can click on the subtitle below to see the full course goals


<details>
<summary>The course goals in this entire lecture:</summary>

1. **Extract Valuable Insights**
- Data science helps organizations and individuals analyze large amounts of data to identify meaningful patterns, trends, and relationships. These insights can lead to actionable business strategies, innovations, and improved decision-making.

2. **Make Data-Driven Decisions**
- One of the main goals of data science is to enable decision-makers to make more informed decisions based on real data rather than intuition or assumptions. By applying statistical models, machine learning, and data analysis, data science supports evidence-based decision-making in various domains such as business, healthcare, marketing, and more.

3. **Predictive Analytics**
- Data science aims to build models that can predict future outcomes or behaviors based on historical data. For example, it can predict customer behavior, market trends, equipment failures, or even the likelihood of diseases in healthcare. Predictive models like regression, classification, and time series analysis are often used.

4. **Automating Processes**
- Through techniques like machine learning and artificial intelligence, data science can help automate repetitive tasks, optimize workflows, and enhance the efficiency of processes. For instance, recommendation systems (like those used by Netflix and Amazon) are powered by data science to suggest relevant products or movies to users.

5. **Improve Operational Efficiency**
- Data science can identify inefficiencies and opportunities for improvement in business operations. For example, analyzing production processes, supply chain data, or customer service records can help organizations streamline their operations, reduce costs, and increase productivity.

6. **Personalization and Customization**
- In the fields of marketing, e-commerce, and entertainment, data science is used to personalize user experiences. By analyzing user behavior, preferences, and demographics, data science helps create customized content, recommendations, and targeted advertisements that resonate with individual users.

7. **Enhance Customer Experience**
- By analyzing customer data, data science helps companies understand customer preferences, needs, and pain points. This can lead to improved products, services, and customer support, ultimately enhancing the customer experience and fostering brand loyalty.

8. **Optimize Decision-Making**
- Data science helps organizations apply advanced analytics techniques like optimization and simulation to make better, more efficient decisions. This is particularly useful in resource management, finance, supply chain logistics, and strategy development.

9. **Support Scientific Research**
- In scientific fields like biology, physics, and social sciences, data science is used to analyze complex datasets, conduct simulations, and make discoveries. It helps researchers process large-scale experiments, run simulations, and model complex phenomena in a way that wasn't possible before.

10. **Uncover Hidden Patterns**
- Data science allows organizations to explore data and discover hidden patterns that may not be immediately obvious. For example, by applying clustering or anomaly detection techniques, data scientists can find new insights in customer behavior, financial transactions, or health data.

11. **Address Real-World Problems**
- Data science plays a crucial role in addressing global challenges. It can be used to solve real-world problems in healthcare (e.g., predicting disease outbreaks), climate science (e.g., modeling weather patterns), urban planning (e.g., optimizing traffic flows), and social sciences (e.g., understanding crime patterns or inequality).

12. **Support Policy and Decision-Making in Government**
- Data science is increasingly being used by governments and public sector organizations to analyze social trends, predict outcomes, and improve public services. By analyzing public data, policymakers can make decisions that benefit society, such as in areas like crime prevention, education, and healthcare.
</details>

## 2.The key components of Data Science

The data science can be divided into several key components

- **1.Data Collection**: Gathering data from various sources such as databases, sensors, web scraping, and APIs.
- **2.Data Cleaning and Preprocessing**: Data often comes in raw forms that need to be cleaned and transformed into a usable format. This includes handling missing values, correcting errors, removing duplicates, and normalizing data.
- **3.Exploratory Data Analysis (EDA)**: Visualizing and analyzing data to identify patterns, trends, and relationships. This helps to formulate hypotheses and choose appropriate analysis techniques.
- **4.Statistical Analysis**: Applying statistical methods to infer relationships and make predictions. This includes hypothesis testing, regression analysis, and probability.
- **5.Machine Learning**: Building and training models using data to make predictions or automate tasks. Techniques include supervised learning (e.g., classification and regression), unsupervised learning (e.g., clustering), and reinforcement learning.
- **6.Data Visualization**:
Creating charts, graphs, and interactive visualizations to communicate findings and insights effectively to stakeholders. Tools like Matplotlib, Seaborn, and Tableau are often used for this.
- **7.Big Data Technologies**:
Using frameworks and tools like Hadoop, Spark, and cloud-based platforms to handle large and complex datasets that cannot be processed using traditional tools.
- **8.Domain Expertise**:
Understanding the specific domain (e.g., healthcare, finance, marketing) to interpret data correctly and derive actionable insights.

## 3.The Workflow of Data Science

The workflow of data science is a circle of data processing.

<img alt="workflow of data science" src="https://www.sudeep.co/images/post_images/2018-02-09-Understanding-the-Data-Science-Lifecycle/chart.png" height="300" />

## 4. A Simple Example of Data Science

To

Firstly, We should scarping or import data object. Here, I will use Python Web Scarping to grasp web content from [bilibili with search key "data science"](https://search.bilibili.com/all?keyword=data+science)

The necessary library of Python Web Scarping needed to be install by pip command.

In [None]:
!pip install requests beautifulsoup4 lxml

[31mERROR: Could not find a version that satisfies the requirement re (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for re[0m[31m
[0m

In [None]:
import requests
from lxml import html

# 目标 URL
url = 'https://search.bilibili.com/all?keyword=data+science'  # 请替换为你想爬取的网页地址
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15'
}

# 发送 HTTP GET 请求获取 HTML 内容
response = requests.get(url, headers=headers)

# 检查请求是否成功 (状态码 200)
if response.status_code == 200:
    # 使用 lxml 解析网页内容
    tree = html.fromstring(response.content)

    # 使用 XPath 提取网页中的标题和链接
    titles = tree.xpath('//div[@id="i_cecream"]//a/h3')  # 修改 XPath 根据页面结构提取标题


    # 创建一个列表，用于存储所有的视频字典
    videos = []

    # 遍历每个标题并将标题与链接存储在字典中
    for i in range(len(titles)):
        title = titles[i].text_content().strip()  # 获取标题文本

        # 将标题和链接添加到字典，并附加到列表中
        video_dict = {'title': title}
        videos.append(video_dict)

    # 打印结果
    print(videos)
else:
    print(f"请求失败，状态码: {response.status_code}")


[{'title': '[udemy] Python for Data Science and Machine Learning Bootcamp'}, {'title': '【数据科学入门】data_science'}, {'title': 'python数据科学（机器学习）持续更新'}, {'title': '约翰霍普金斯大学《数据科学：统计和机器学习|Data Science: Statistics and Machine Learning》中英字幕'}, {'title': '哈佛大学 CS109 数据科学 Data Science（2019）'}, {'title': 'Intro to Data Science  -- 1'}, {'title': '密歇根大学《Python用于数据科学实践（1-3课/共5课）|Applied Data Science with Python 》deepseek翻译'}, {'title': '吴恩达同步更新AI专业课，第53讲：Data Engineering 数据工程。吴恩达AI最新specialization 课程'}, {'title': '医保结算数据分析思路与实践'}, {'title': '【Data Science丨IBM数据科学】IBM官方课程，跟着数据科学家入门大数据，课程资料免费领取，还提供Coursera官方课程证书，证书2折优惠！'}, {'title': '谷歌数据分析师第一课《基础： 数据，数据，无处不在》foundations-data'}, {'title': 'Udemy - Complete Data Science,Machine Learning,DL,NLP Bootcamp part1'}, {'title': '[Coursera公开课] 用于数据科学的SQL SQL for Data Science'}, {'title': '【卷王专业】之美国数据科学Data Science硕士申请你不得不知道的事！【采访视频】'}, {'title': 'UCB《数据100数据科学原理与技术|Data 100 Principles and Techniques of Data Science 24SP》中英'}, {'title': 'Data Science Lectures'},

In [None]:
!pip install pandas openpyxl



In [None]:
import requests
import re
import pandas as pd
import os

print(os.getcwd())

url = "https://movie.douban.com/top250"

header = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.20 Safari/537.36"
}

# 发送请求并获取页面内容
response = requests.get(url, headers=header)
page_content = response.text
response.close()

# 正则表达式来匹配电影信息
obj = re.compile(r'<li>.*?<span class="title">(?P<name>.*?)</span>.*?<br>(?P<time>.*?)&nbsp.*?'
                 r'<span class="rating_num" property="v:average">(?P<score>.*?)</span>.*?<span>(?P<num>.*?)</span>', re.S)

# 匹配到的数据
result = obj.finditer(page_content)



# 将结果转换为列表形式
data = []
for i in result:
    dic = i.groupdict()
    data.append(dic)

# 将数据保存为 Excel 文件
df = pd.DataFrame(data)  # 将列表转换为 DataFrame
print(df)
df.to_excel("豆瓣data.xlsx", index=False)  # 保存为 Excel 文件

if os.path.exists("/content/豆瓣data.xlsx"):
    print("文件已保存!")
else:
    print("文件未保存!")

print("over!")

/content
        name                                time score         num
0       星际穿越  \n                            2014   9.4  2046072人评价
1       盗梦空间  \n                            2010   9.4  2224412人评价
2    忠犬八公的故事  \n                            2009   9.4  1485428人评价
3    三傻大闹宝莱坞  \n                            2009   9.2  1991205人评价
4     放牛班的春天  \n                            2004   9.3  1412579人评价
5      疯狂动物城  \n                            2016   9.2  2132192人评价
6     机器人总动员  \n                            2008   9.3  1418664人评价
7         熔炉  \n                            2011   9.3   997872人评价
8       触不可及  \n                            2011   9.3  1227233人评价
9      寻梦环游记  \n                            2017   9.1  1856260人评价
10    当幸福来敲门  \n                            2006   9.2  1624083人评价
11      怦然心动  \n                            2010   9.1  1967667人评价
12  蝙蝠侠：黑暗骑士  \n                            2008   9.2  1144004人评价
13     我不是药神  \n                            2018   9.

Download the file from goole colab virtual environment

In [67]:
from google.colab import files

# 下载文件
files.download("/content/豆瓣data.xlsx")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>